Hi all, I have found the solution to my  "HELP FORMATING A FILE".
Actually, I was already very close to the solution. In case someone was
interested, here it is the script
Regards,

Pedro

#!/usr/sbin/perl -w
#use strict;
if (!@ARGV) {
    print "usage: $0 blast_output \n";
    exit 0;
}
while (<>) {

if (/(>\S+)\s*/) {
        print "$1\n";
}
next if (/Length/);
next if (/^\s*$/);
if (/Sbjct/) {
 chomp;
 my ($query, $number1, $sequence, $number2) = split;
$sequence =~ tr/-//d;
print "$sequence\n";
}
}



HI All, I have a file from a blast report output which looks like the
following:

gi|12383919|gb|BF981107.1|BF981107  602310351F1 NIH_MGC_88 H...   271
4e-72
gi|12168431|gb|BF825777.1|BF825777  MR2-HN0035-171100-001-a0...   242
3e-63

                                                Alignments

>gi|12383919|gb|BF981107.1|BF981107 602310351F1 NIH_MGC_88 Homo sapiens
cDNA clone IMAGE:4401421 5'.
          Length = 967

 Score =  271 bits (694), Expect = 4e-72
 Identities = 135/141 (95%), Positives = 138/141 (97%)
 Frame = +3

Query: 17  QAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
76
           +AGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
Sbjct: 15  EAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
194

Query: 77  RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
136
           RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
Sbjct: 195 RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
374

Query: 137 LHVAATYGLPGVLAVFKSGIQ 157
           LHVAATYGLPGVL V+ +G Q
Sbjct: 375 LHVAATYGLPGVLLVWPAGRQ 437

 Score = 32.7 bits (73), Expect = 4.4
 Identities = 21/46 (45%), Positives = 25/46 (53%), Gaps = 11/46 (23%)
 Frame = +2

Query: 133 GRSVLHVAAT------YGLPGVLAVFK-----SGIQVDLEARDFEG 167
           GR V  + A+      Y  P V  +F      SG+QVDLEARDFEG
Sbjct: 452 GRLVAQILASRPGGQGYPYPAVCLLFLPGCAYSGVQVDLEARDFEG 589

>gi|12168431|gb|BF825777.1|BF825777 MR2-HN0035-171100-001-a09 HN0035
Homo sapiens cDNA.
          Length = 598

 Score =  242 bits (618), Expect = 3e-63
 Identities = 136/184 (73%), Positives = 139/184 (74%), Gaps = 33/184
(17%)
 Frame = +1

Query: 16  PQAGPWRVSA-----PPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT----
66
           PQA  WR+       P   PPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT
Sbjct: 31  PQA--WRLDPGEFLHPLQ*PPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT*V*G
204

Query: 67  -----------------------LLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
103
                                  LLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
Sbjct: 205 IGLSADSWLGGGCSHGCPPPVLRLLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
384

Query: 104 LLVAAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVAATYGLPGV-LAVFKSGIQVDLEA
162
           LLV AAANQPLIVEDLLNLGAEPNAADHQGRSVLHV ATYGLPGV LAV  SG+ V+LEA
Sbjct: 385 LLVVAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVGATYGLPGVLLAVLNSGVHVELEA
564

Query: 163 RDFE 166
           RDFE
Sbjct: 565 RDFE 576

and bassically I want to extract the "Sbjct" lines under every ">"
initiated record
and come out with a file that for the above case will look as follows:

>gi|12383919|gb|BF981107.1|BF981107
EAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
LHVAATYGLPGVLLVWPAGRQ
>gi|12168431|gb|BF825777.1|BF825777
PQAWRLDPGEFLHPLQPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTVG
IGLSADSWLGGGCSHGCPPPVLRLLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
LLVVAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVGATYGLPGVLLAVLNSGVHVELEA
RDFE

The sequence of strings under the line starting with ">" could  be in a
single line.

The code under these lines is doing something to one of the ">" started
record, but still is not right. Moreover,  I do not know how to make the
program jump from one ">"  record to next one.

Please help.

#!/usr/sbin/perl -w
use strict;
if (!@ARGV) {
    print "usage: $0 blast_output \n";
    exit 0;
}

while (<>) {

if (/(>\S+)\s*/) {
 print "$1\n";
}
next if (/Length/);
next if (/^\s*$/);

if (/Query/) {
 chomp;
 my ($query, $number1, $sequence, $number2) = split;

$sequence =~ tr/-//d;
$sequence.= $sequence;

}
}

print "$sequence\n";



--
***************************************************************************

PEDRO a. RECHE gallardo, pHD            TL: 617 632 3824
Scientist, Mol.Immnunol.Foundation,     FX: 617 632 3351
Dana-Farber Cancer Institute,           EM:
[EMAIL PROTECTED]
Harvard Medical School,                 URL: http://www.reche.org
44 Binney Street, D610C,
Boston, MA 02115
***************************************************************************


Reply via email to