Hi all, I have found the solution to my "HELP FORMATING A FILE".
Actually, I was already very close to the solution. In case someone was
interested, here it is the script
Regards,
Pedro
#!/usr/sbin/perl -w
#use strict;
if (!@ARGV) {
print "usage: $0 blast_output \n";
exit 0;
}
while (<>) {
if (/(>\S+)\s*/) {
print "$1\n";
}
next if (/Length/);
next if (/^\s*$/);
if (/Sbjct/) {
chomp;
my ($query, $number1, $sequence, $number2) = split;
$sequence =~ tr/-//d;
print "$sequence\n";
}
}
HI All, I have a file from a blast report output which looks like the
following:
gi|12383919|gb|BF981107.1|BF981107 602310351F1 NIH_MGC_88 H... 271
4e-72
gi|12168431|gb|BF825777.1|BF825777 MR2-HN0035-171100-001-a0... 242
3e-63
Alignments
>gi|12383919|gb|BF981107.1|BF981107 602310351F1 NIH_MGC_88 Homo sapiens
cDNA clone IMAGE:4401421 5'.
Length = 967
Score = 271 bits (694), Expect = 4e-72
Identities = 135/141 (95%), Positives = 138/141 (97%)
Frame = +3
Query: 17 QAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
76
+AGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
Sbjct: 15 EAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
194
Query: 77 RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
136
RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
Sbjct: 195 RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
374
Query: 137 LHVAATYGLPGVLAVFKSGIQ 157
LHVAATYGLPGVL V+ +G Q
Sbjct: 375 LHVAATYGLPGVLLVWPAGRQ 437
Score = 32.7 bits (73), Expect = 4.4
Identities = 21/46 (45%), Positives = 25/46 (53%), Gaps = 11/46 (23%)
Frame = +2
Query: 133 GRSVLHVAAT------YGLPGVLAVFK-----SGIQVDLEARDFEG 167
GR V + A+ Y P V +F SG+QVDLEARDFEG
Sbjct: 452 GRLVAQILASRPGGQGYPYPAVCLLFLPGCAYSGVQVDLEARDFEG 589
>gi|12168431|gb|BF825777.1|BF825777 MR2-HN0035-171100-001-a09 HN0035
Homo sapiens cDNA.
Length = 598
Score = 242 bits (618), Expect = 3e-63
Identities = 136/184 (73%), Positives = 139/184 (74%), Gaps = 33/184
(17%)
Frame = +1
Query: 16 PQAGPWRVSA-----PPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT----
66
PQA WR+ P PPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT
Sbjct: 31 PQA--WRLDPGEFLHPLQ*PPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDT*V*G
204
Query: 67 -----------------------LLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
103
LLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
Sbjct: 205 IGLSADSWLGGGCSHGCPPPVLRLLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
384
Query: 104 LLVAAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVAATYGLPGV-LAVFKSGIQVDLEA
162
LLV AAANQPLIVEDLLNLGAEPNAADHQGRSVLHV ATYGLPGV LAV SG+ V+LEA
Sbjct: 385 LLVVAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVGATYGLPGVLLAVLNSGVHVELEA
564
Query: 163 RDFE 166
RDFE
Sbjct: 565 RDFE 576
and bassically I want to extract the "Sbjct" lines under every ">"
initiated record
and come out with a file that for the above case will look as follows:
>gi|12383919|gb|BF981107.1|BF981107
EAGPWRVSAPPSGPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTLLHLFAARGL
RWAAYAAAEVLQVYRRLDIREHKGKTPLLVAAAANQPLIVEDLLNLGAEPNAADHQGRSV
LHVAATYGLPGVLLVWPAGRQ
>gi|12168431|gb|BF825777.1|BF825777
PQAWRLDPGEFLHPLQPPQFPAVVPGPSLEVARAHMLALGPQQLLAQDEEGDTVG
IGLSADSWLGGGCSHGCPPPVLRLLHLFAARGLRWAAYAAAEVLQVYRRLDIREHKGKTP
LLVVAAANQPLIVEDLLNLGAEPNAADHQGRSVLHVGATYGLPGVLLAVLNSGVHVELEA
RDFE
The sequence of strings under the line starting with ">" could be in a
single line.
The code under these lines is doing something to one of the ">" started
record, but still is not right. Moreover, I do not know how to make the
program jump from one ">" record to next one.
Please help.
#!/usr/sbin/perl -w
use strict;
if (!@ARGV) {
print "usage: $0 blast_output \n";
exit 0;
}
while (<>) {
if (/(>\S+)\s*/) {
print "$1\n";
}
next if (/Length/);
next if (/^\s*$/);
if (/Query/) {
chomp;
my ($query, $number1, $sequence, $number2) = split;
$sequence =~ tr/-//d;
$sequence.= $sequence;
}
}
print "$sequence\n";
--
***************************************************************************
PEDRO a. RECHE gallardo, pHD TL: 617 632 3824
Scientist, Mol.Immnunol.Foundation, FX: 617 632 3351
Dana-Farber Cancer Institute, EM:
[EMAIL PROTECTED]
Harvard Medical School, URL: http://www.reche.org
44 Binney Street, D610C,
Boston, MA 02115
***************************************************************************