Michael Oldham wrote:
Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task.  I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

.........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format.  These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above).  I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt".  The script below is missing
something because the output file ("probe_subset.txt") is blank.  This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at).  Does anyone know what I am
missing?  I am running this script in Cygwin on Windows XP.  I
appreciate any suggestions!

~ Mike O.
[...]

You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.


use strict;
use warnings;

# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};

# @ids should come from ID.txt.
my @ids = (1102);

my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");

while ($_ = <$fh>) {
    next unless m/^>probe:[^:]+:(\d+)/;
    my $id = $1;
    if (grep $id eq $_, @ids) {
        print;
        $_ = <$fh>;
        print;
    }
}

close $fh;




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to