Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task.  I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

.........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format.  These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above).  I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt".  The script below is missing
something because the output file ("probe_subset.txt") is blank.  This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at).  Does anyone know what I am
missing?  I am running this script in Cygwin on Windows XP.  I
appreciate any suggestions!

~ Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.txt';

unless (open(IDFILE, $IDs)) {
        print "Could not open file $IDs!\n";
        }

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
        print "Could not open file $probes!\n";
        }

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;

        while (<PROBES>) {
                foreach (@ID) {
                        if(/^>probe:\w+:(\w+):/) {
                                print OUT;
                                print OUT scalar(<PROBES>);
                        }
                }

        }
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to