Hi, I am interested in parsing the file at the bottom of this e-mail in
order to extract the string between "" following /product=,
/protein_id=, /db_xref= and /translation=, and that for each of the
segment separated by the string "CDS". The ouptput for the example
bellow should look like this:
>V001|AAM13451.1|GI:20152990
MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL
TKILNLFLMVSIKRSIFLTL
>V002|AAA60951.1|GI:333518
KQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI
IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP
KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD
SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG
FKYVDGSASEDAADDTSLINSAKLIACV
So far I have use the code below which actually work. However, I am not
please with it, as it generates an empty element in the hash from the
header of the file and becasue that there might be a better way to do
this. Thereby, I will be very pleased for any input or alternative way
to improve the code.
Regards,
pedro
#!/usr/sbin/perl -w
$/ = "\n CDS";
while(<>){
$_ =~ /product=\"(.+)\"/;
$gname = $1;
$gname =~ s/\s+//g;
push @ID, $gname;
$_ =~ /protein_id="([\w\.]+)\"/;
$ref = $1;
$_=~ /db_xref=\"GI:(\w+)\"/;
$gid = $1;
$_ =~ /translation=\"([A-Z\s]+)/;
$seq = $1;
$seq =~ s/\s+//g;
$hash{$gname} = ["$ref", "$gid", "$seq"];
}
open(F, ">test");
foreach $key (@ID){
print F ">gi|$hash{$key}[1]|$hash{$key}[0]
$key\n$hash{$key}[2]\n";
}
close(F);
REFERENCE 6 (bases 1 to 224501)
AUTHORS Dietrich,F.S., Ray,C.A., Sharma,A.D., Allen,A. and Pickup,D.J.
TITLE Direct Submission
JOURNAL Submitted (11-FEB-2002) Molecular Genetics and Microbiology,
Duke
University Medical Center, Box 3020 DUMC, 421 Jones
Building,
Durham, NC 27710, USA
COMMENT On Apr 16, 2002 this sequence version replaced gi:333516.
FEATURES Location/Qualifiers
source 1..224501
/organism="Cowpox virus"
/strain="Brighton Red"
/db_xref="taxon:10243"
CDS complement(156..350)
/codon_start=1
/evidence=not_experimental
/product="V001"
/protein_id="AAM13451.1"
/db_xref="GI:20152990"
/translation="MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL
TKILNLFLMVSIKRSIFLTL"
CDS complement(2743..3483)
/codon_start=1
/evidence=not_experimental
/product="V002"
/protein_id="AAA60951.1"
/db_xref="GI:333518"
/translation="MKQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI
IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP
KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD
SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG
FKYVDGSASEDAADDTSLINSAKLIACV"
BASE COUNT 74832 a 37730 c 37261 g 74678 t
ORIGIN
1 tagtaaaatt aaattaatta taaaattata tatataattt actaacttta
gttagataaa
61 ttaataatat ataagtttta gtacattaat attatatttt aaatatttta
tttagtgtct
//
*******************************************************************
PEDRO A. RECHE , pHD TL: 617 632 3824
Dana-Farber Cancer Institute, FX: 617 632 4569
Harvard Medical School, EM: [EMAIL PROTECTED]
44 Binney Street, D1510A, EM: [EMAIL PROTECTED]
Boston, MA 02115 URL: http://www.reche.org
*******************************************************************
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]