Michael Oldham wrote:
Dear all,
I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
.........etc.
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!
~ Mike O.
[...]
You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.
use strict;
use warnings;
# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};
# @ids should come from ID.txt.
my @ids = (1102);
my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");
while ($_ = <$fh>) {
next unless m/^>probe:[^:]+:(\d+)/;
my $id = $1;
if (grep $id eq $_, @ids) {
print;
$_ = <$fh>;
print;
}
}
close $fh;
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>