Your code can be simplified quite a bit if I correctly understand what you
were actually trying to do. I have taken a stab at it but had to guess at
your intent with the layout of the fields. Let's clear up the field layout.
Your data have 9 fields (at least to the untrained eye), separated by the |
character. Yet your "Fields:" line has 12 field names. Looking at this as 9
fields, do you wish to return fields 1, 6, 7, 8, and some portion of 9?

It's the 9th field that is troublesome. If you look at the last field of
your REQUIRED output, you want *6* items separated by spaces in rows 1 and 4
but only *5* items in rows 2 and 3. Is this a mistake? Please count your
fields in the raw data and your desired output. Make sure you are correct in
what you have and what you need. Where is the "gene n chromosome info?"

Assumptions I made in order to solve your problem: Since the 4th
space-delimited column of the 9th field is always zero, I am assuming that
you do not need it. I am also assuming that you need 6 items from the 9th
field. Therefore, for the 9th field, I am assuming that you need items: 1-3
and 5-7. Hopefully, this will solve your problem with the data.

I also check the web page for the gi number. Please note that you are not
keeping track of the gi numbers in any way that would make it easy to see if
you have already looked up that gi number (this should be done in a hash,
not an array). Lines 3 and 4 of your data have the same gi number. This
routine submits one gi for each line, so there might be repeats. I save the
web page to "gi".htm (substitute the actual gi number for "gi"). I then
extract the DNA map text to "gi".txt. I then add "gi".txt to the end of the
reference to the array for that record.

Hopefully this will give you enough to complete the rest yourself. If you do
not require a printout of the final array, you can delete the two lines
containing "Dumper."

-------BEGIN CODE-------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use Data::Dumper;

my @data;
my $ua    = LWP::UserAgent->new() or die "Could not create UserAgent: $!\n";
my $noeol = '[^\n\r\x0A\x0D]';

while (<DATA>) {
  next if /^#/ or /^\s*$/;
  chomp;
  push @data, [ ( split /\|/, $_ )[0,5..8] ];
  $data[-1]->[-1] = join ' ', ( split ' ', $data[-1]->[-1] )[0..2,4..6];

  my $gi = $data[-1]->[1];
  my $response =
$ua->get("http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=$
gi",
                          ':content_file' => "$gi.htm");
  if ($response->is_success) {
    local $/ = undef;
    open HTM, "$gi.htm"  or die "Cannot open $gi.htm for reading: $!\n";
    (my $text = <HTM>) =~ s/.+ORIGIN$noeol+.(.+?)\/\/.+/$1/s;
    open TXT, ">$gi.txt" or die "Cannot open $gi.txt for writing: $!\n";
    print TXT $text;
    push @{$data[-1]}, "$gi.txt";
  } else {
    print $response->as_string;
  }
}
print Dumper [EMAIL PROTECTED];

__DATA__
# BLASTN 2.2.9 [May-01-2004]
your data from below goes here
-------END CODE-------


"Aditi gupta" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> hi to all,
>
> i had a file which contained following data:
>
> # BLASTN 2.2.9 [May-01-2004]
> # Query: gi|37182815|gb|AY358849.1| Homo sapiens clone DNA180287 ALTE
(UNQ6508) mRNA, complete cds
> # Database: nr
> # Fields: Query id, Subject id, % identity, alignment length, mismatches,
gap openings, q. start, q. end, s. start, s. end, e-value, bit score
> gi|37182815|gb|AY358849.1| gi|28592069|gb|U63637.2|BTU63637 100.00 17 0 0
552 568 3218 3234   1.1 34.19
> gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 95.24 21 1 0 435 455
56604 56624   1.1 34.19
> gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 100.00 16 0 0 260
275 89982 89967   4.2 32.21
> gi|37182815|gb|AY358849.1| gi|7385112|gb|AF222766.1|AF222766 100.00 17 0 0
345 361 242 226   1.1 34.19
>
> but i required only some of the fields, and with the help of members of
this maillist, i succeeded and obtained following output:
>
> gi|28592069|gb|U63637.2|BTU63637   100.00   17   0   552   568
> gi|14318385|gb|AC089993.2|   95.24   21  1  435  455
> gi|14318385|gb|AC089993.2|  100.00   16  0  260  275
> gi|7385112|gb|AF222766.1|AF222766  100.00  17  0  345  361

> [code snipped]
>
>
>
> but i also have to feed the gi number(the first field) into ncbi entrez
nucleotide site:
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
> and retreive the gene and chromosome name, if available from the resulting
web page ........
> is it possible to get the gene n chromosome info in the output with other
fields?what changes in code are required?



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to