subroutine problem

2004-12-01 Thread Michael Robeson
I have written a rather simplistic script so I can get used to  
LWP::Simple etc... Anyway I am using a subroutine to get and print  
data from a website. I have gotten it to work except for the fact that  
the first iteration of the subroutine uses no data at all, yet after  
that it works fine. I know it has to do with how I am passing the data  
into the subroutine. The output is as follows (the perl code that was  
used is below too):

Begin OutPut
Fetching
# See it is fetching nothing
Appending  to fetched_ncbi_sequences.txt.   # Appending 
nothing
Fetching cv889431   
# It works fine from this point on
Appending cv889431 to fetched_ncbi_sequences.txt.
Fetching cv889432
Appending cv889432 to fetched_ncbi_sequences.txt.
Fetching cv889433
Appending cv889433 to fetched_ncbi_sequences.txt.
Fetching cv889434
Appending cv889434 to fetched_ncbi_sequences.txt.
Fetching cv889435
Appending cv889435 to fetched_ncbi_sequences.txt.
Fetching cv889436
Appending cv889436 to fetched_ncbi_sequences.txt.
Fetching cv889437
Appending cv889437 to fetched_ncbi_sequences.txt.
Fetching cv889438
Appending cv889438 to fetched_ncbi_sequences.txt.
Fetching cv889439
Appending cv889439 to fetched_ncbi_sequences.txt.
Fetching cv889440
Appending cv889440 to fetched_ncbi_sequences.txt.
Fetching cv889441
Appending cv889441 to fetched_ncbi_sequences.txt.
**Finished**
/End OutPut
The script is as follows:
Begin code
#!usr/bin/perl -w
use strict;
use LWP::Simple;
open(FASTA, fetched_ncbi_sequences.txt)
or die Cannot open FASTA file: $!;
print \n\t**Welcome to Mike Robeson's NCBI-fetch Script!**\n
A - Just enter in the accession numbers of the sequence data
you wish to pull from genbank individually (e.g. cv889410) or
by defining a range (cv889431-cv889441). Hit enter after
each entry or entry range.\n
B - When finished, hit enter one last time and press ctrl-d.\n
C - All sequence data will be downloaded into one file in FASTA
format (e.g. fetched_ncbi_sequences.txt).
\n\n;
print Enter a list of Sequence IDs to fetch:\n;
chomp (my @list = ARGV);
printSequence;
foreach my $id (@list) {
if ($id =~ s/([a-z]*)(\d+)-[a-z]*(\d+)//) {
my @range = split(/-/,$id);
my $init_range_letters = $1;
my $init_range_num = $2;
my $term_range_num = $3;
for (my $count = $init_range_num; $count=$term_range_num; 
$count++) {
my $genbank = $init_range_letters.$count;
printSequence($genbank);   
}
} else {
printSequence($id);
}
}
print \n\n**Finished**\n\n;
sub printSequence {
	my $accession = @_;
	print Fetching $accession \n;
	my $data =  
get(http://www.ncbi.nlm.nih.gov/entrez/batchseq.cgi? 
cmd=txt=onsave=cfm=term=list_uids=$accessiondb=nucleotideextrafea 
t=16view=fastadispmax=20SendTo=t__from=__to=__strand=);	
	print FASTA $data;
	print Appending \$accession\ to \fetched_ncbi_sequences.txt\\.\n;
}

\End Code
I have been trying to figure out why this is occurring and have  
remained stumped for 3 hours now and I can't figure out what is going  
on. Any suggestions?

-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-15 Thread Michael Robeson
Errin,
Thanks so much! I will spend the weekend going over what you've posted. 
Looks like I will learn a lot from this post alone. This stuff is so 
addictive. I can spend hours doing this and not realize it. If I am 
successful or not is another story!  :-)

I'll definitely let you know if I have any trouble.
-Cheers!
-Mike
On Oct 15, 2004, at 7:22 AM, Errin Larsen wrote:
On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson [EMAIL PROTECTED] 
wrote:
Yeah, I have just submitted that same question verbatim to the 
bio-perl
list. I am still running through some ideas though. I have both
Bioinformatics perl books. They are not very effective teaching books.

The books spend too much time on using modules. Though while I
understand the usefulness of not having to re-write code, it is a bad
idea for beginners like me. Because re-writing code at first gives me 
a
lot of practice. Some of the scripts in the books use like 3-5 
modules,
so it gets confusing on what is going on.

I mean the books are not useless, but they definitely are structured
for a class with a teacher.
:-)
-Mike
Hi again, Mike!
I've thrown together the following code.  I have not commented this!
If you have some questions, just ask.  I hard coded the sequences for
my ease-of-use.  It looked to me like you have figured out how to grab
the sequences out of  a file and throw them in a hash.  This code uses
some deep nested references, and therefore, some crazy dereferences.
Have fun with it, I know I did!  Things that might look weird:  check
out perldoc -f split for info on using a null-string to split with
(That's were I found it!) and of course perldoc perlref for all the
deep nested references and dereferencing stuff!  I'm currently reading
Learning Perl Objects, References  Modules by Randal Schwartz.  I
highly recommend it.  It helped a lot in this exercise.  Here's the
code:
use warnings;
use strict;
my %sequences = (
'Human' = acgtt---cgatacg---acgact-t,
'Chimp' = acgtt---cgatacg---acgact-t,
'Mouse' = acgata---acgatcgacgt,
);
my %results;
foreach my $species( keys %sequences ) {
my $is_base_pair_gap = 0;
my $base_pair_gap;
my $base_pair_gap_pos;
my $position = 1;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
unless( $is_base_pair_gap ) {
$base_pair_gap_pos = $position;
}
$is_base_pair_gap = 1;
$base_pair_gap .= $_;
} elsif( $is_base_pair_gap ) {
push
@{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;
$is_base_pair_gap = 0;
$base_pair_gap = undef;
}
$position++;
}
}
foreach my $species( keys %results ) {
print $species:\n;
foreach my $base_pair_gap( keys %{$results{$species}} ) {
printNumber of $base_pair_gap base pair gaps:\t,
scalar( @{$results{$species}{$base_pair_gap}}), \n;
print  at position(s) , join( ',',
@{$results{$species}{$base_pair_gap}} ), .\n;
}
print \n;
}

The heart of this code is this line:
push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;
there is a %results hash which has keys that are the different
species, and values that point to another hash.  THAT hash (the inner
hash) has keys that are the length of the base-pair-gaps, and values
that point to an array.  The array holds a list of the positions of
those base-pair gaps!  The first base pair gap in the human sequence
is '---' at the 6th character.  That looks like this (warning: pseudo
code for clarity!)
  %results-{'Human'}-{ 3 }-[6]
When we find the second '---' gap, we add it's position to the array:
  %results-{'Human'}-{ 3 }-[6,16]
Then, we find a new base-pair-gap ('-') so we add a new key to 
inner hash:
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
Next, we move on to the next species ...
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
   -{'Mouse'}-{ 3 }-[7]

So, finally, with Data::Dumper, we can see the %results hash when the
code is done processing the sequence:
%results = {
  'Human' = {
   '3' = [
6,
16
  ],
   '5' = [
25
  ]
 },
  'Mouse' = {
   '4' = [
17
  ],
   '3' = [
7
  ]
 },
  'Chimp' = {
   '3' = [
6,
16

counting gaps in sequence data

2004-10-14 Thread Michael Robeson
I have a set of data that looks something like the following:
human
acgtt---cgatacg---acgact-t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcgacgt
I am having trouble setting up a hash etc., to count the number and 
types of continuous gaps. For example the 'human' sequence above has 2 
sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps 
and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.

So, I am having trouble being able to assign a dynamic variable (i.e. 
gap length) and place that in a pattern match so that it can count how 
many gaps of that length are in that particular sequence. I know how to 
set up a hash to count the number of times a gap appears: 
'$gaptype{$gap}++' or something. The problem is: what is the best way 
(and how) can I set '$gap' to be dynamic.

I need to know the length of each consecutive string of gaps. I know 
how to count the gaps by using the 'tr' function. But it gets confusing 
when I need to add counts to every instance of that gap length. I also 
need to know the position of each gap (denoted by the position of the 
first gap in that particular instance). I know that I can use the 
'pos()' command for this.

So, my problem is that I think I know some of the bits of code to put 
into place the problem is I am getting lost on how to structure it all 
together. For now I am just trying to get my output to look like this:

Human
number of 3 base pair gaps: 2
at positions:   6, 16
number of 5 base pair gaps: 1
at positions:   25
Chimp
 and so on ...
So, any suggestions would be greatly appreciated. If anyone can help me 
out with all or even just bits of this I would greatly appreciate it. 
This should help me get started on some more advanced parsing I need to 
do after this. I like to try and figure things out on my own if I can, 
so even pseudo code would be of great help!

-Thanks
-Mike

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-14 Thread Michael Robeson
Here is as far as I can get with some real code and pseudo code. This 
is just to show you that I am actually trying. :-)

Pseudo - code

# open DNA sequence file

print Enter in the name of the DNA sequence file:\n;
chomp (my $dna_seq = STDIN);
open(DNA_SEQ, $dna_seq)
or die Can't open sequence file for input: $!;

# read sequence data into a hash - not sure if this is how I should do 
it?


my %sequences;
$/ = ''; # set to read in paragraph mode
print \n***Discovered the following DNA sequences:***\n;
while ( DNA_SEQ ) {
chomp;
next unless s/^\s*(.+)//;   
my $name = $1;  
s/\s//g;
$sequences{$name} = $_;
print $name found!\n;
}
close DNA_SEQ;
###
# search for and characterize gaps
###
somehow get data from hash and present it to a loop
%gaptype;
major pseudo code below
foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by 
sequence
	$position = pos($1);
	$gaplength = $1;
	$gaplength =~ tr/-//g;	# count the number of '-' for that particular
		# gap being processed
	$gaptype{gaplength}++;	# count the number of times each gap type 
appears
	
somehow get information from loop an print as seen below

OUTPUT_
Human
number of 3 base pair gaps: 2
at positions:   6, 16
number of 5 base pair gaps: 1
at positions:   25
.
.. and so on...
.
__DATA__
human
acgtt---cgatacg---acgact-t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcgacgt
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-14 Thread Michael Robeson
Yeah, I have just submitted that same question verbatim to the bio-perl 
list. I am still running through some ideas though. I have both 
Bioinformatics perl books. They are not very effective teaching books.

The books spend too much time on using modules. Though while I 
understand the usefulness of not having to re-write code, it is a bad 
idea for beginners like me. Because re-writing code at first gives me a 
lot of practice. Some of the scripts in the books use like 3-5 modules, 
so it gets confusing on what is going on.

I mean the books are not useless, but they definitely are structured 
for a class with a teacher.

:-)
-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Moving between hashes 2.

2004-09-24 Thread Michael Robeson
Gunnar,
Thanks so much for the help and the links! They help quit a bit. I 
decided to use the if statement you posted:

 if ( $aa eq '-' ) {
$hash3{$_} .= '---';
} else {
$hash3{$_} .= substr $dna,0,3,'';
}
instead of:
$hash3{$_} .= $aa eq '-' ? '---' : substr $dna,0,3,'';
only because I had to add a $count++ function within the else statement 
(shown below) to accomplish another task within my larger script:

if ( $aa eq '-' ) {
$hash3{$_} .= '---';
} else {
$hash3{$_} .= substr $dna,0,3,'';
   $count++
}
I couldn't figure out if it was possible to add $count++ within the ?: 
statement above. I tried but could not get it to work.

However, everything works well at this point. Again, I really 
appreciate the help!

-Mike
On Sep 20, 2004, at 6:55 PM, [EMAIL PROTECTED] wrote:
From: Gunnar Hjalmarsson [EMAIL PROTECTED]
Date: September 19, 2004 9:12:32 PM MDT
To: [EMAIL PROTECTED]
Subject: Re: Moving between hashes 2.
Michael S. Robeson II wrote:
Ok, well I think I can see the forest but I have little idea as to
what is actually going on here. I spent a few hours looking things
up and I have a general sense of what is actually occurring but I
am getting lost in the details that were posted in the last digest.
Well, before an attempt to explain and/or point you to the applicable
docs, I'd like to change my mind once again. :)  This is my latest
idea:
my %hash3;
for ( keys %hash1 ) {
my $dna = $hash2{$_};
for my $aa ( split //, $hash1{$_} ) {
$hash3{$_} .= $aa eq '-' ? '---' : substr $dna,0,3,'';
}
}
I'll assume that you don't have a problem with the outer loop, that
simply iterates over the hash keys. As a first step in each iteration
I copy the DNA sequence to the $dna variable, so as to not destroying
%hash2.
Over to the 'tricky' part. The inner loop iterates over each character
in the amino-acid sequence data, and respective character is assigned
to $aa. For that I use the split() function:
http://www.perldoc.com/perl5.8.4/pod/func/split.html
  $hash3{$_} .= $aa eq '-' ? '---' : substr $hash2{$_},0,3,'';
This is something new to me. I think I follow your use of the ?:
pattern feature. However, none of the perl books I have discuss
it's use in this fashion.
That sounds strange to me, because that's how it should be used...
Read about the conditional operator in
http://www.perldoc.com/perl5.8.4/pod/perlop.html
OTOH, that notation is basically the same as:
if ( $aa eq '-' ) {
$hash3{$_} .= '---';
} else {
$hash3{$_} .= substr $dna,0,3,'';
}
which is a little more intuitive (at least I think it is).
So, as far as I can tell, you are saying: hey, if you find '-' in
$aa then append a '---' in $hash3, otherwise append the next three
DNA letters.
Precisely.
However, I do not understand the syntax of how perl is actually
doing this.
Hopefully the if/else statement makes it easier to grasp, and the '.='
operator is used just for appending something to a string.
Finally we have my use of the substr() function.
http://www.perldoc.com/perl5.8.4/pod/func/substr.html
It returns the first three characters in $dna, and since I also pass
the null string as the fourth argument, it changes the content of $dna
at the same time, i.e. it replaces the first three characters with
nothing.
HTH. If you need further explanations, you'll have to ask specific
questions.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Moving between hashes 2.

2004-09-17 Thread Michael Robeson
x-tad-bigger**Sorry, if this is a repeat. Wasn't sure if the mail went through. If you already replied can you re-send it to my e-mail address above as well? Thanks!***

I have two sets of data that have been stored in hashes. The first hash /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger has amino-acid (protein) sequence data. The second hash has the /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger corresponding DNA sequence of those amino-acids: /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger Hash 1 /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger key: value: /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger cat   =   mfgdhf /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger doq = mfg--f /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger mouse =   mf-d-f /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger Hash 2 /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger key: value: /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger cat = agtcatgcacactgatcg /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger dog = agtcatgcatcg /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger mouse = agtcatcactcg /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger And I need to insert gaps (missing or absent data) proportionally into /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger the DNA sequence (Hash 2) so that the output is as follows: /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger Hash 3 /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger key: value: /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger cat = agtcatgcacactgatcg /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger dog = agtcatgca--tcg /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger mouse = agtca---tca---ctcg /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger It doesn't look right here, but all the lines should end up being the /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger same length with courier font. Basically, I am having trouble scanning /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger though, say...  hash1{cat} and for every  dash found there being /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger finally represented as  three dashes in hash2{cat}. Also, every /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger amino-acid is represented by 3 DNA letters. This is why I need to move /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger in increments of 3 and add in increments of 3 for my final data to /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger appear as it does in Hash 3. /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger Example of relationship: /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger M F DF  = amino-acid /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger agt   tca --- act --- tcg  = dna /x-tad-biggerx-tad-bigger

/x-tad-biggerx-tad-bigger I have everything else set up I just need a few suggestions on how to /x-tad-biggerx-tad-bigger
/x-tad-biggerx-tad-bigger do the above. Any help will be greatly appreciated. /x-tad-biggerx-tad-bigger


/x-tad-biggerinline: spacer.gifx-tad-bigger

/x-tad-bigger -- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response


Re: combining data from more than one file...

2004-05-20 Thread Michael Robeson
Thanks to those that helped. The code works great. Now I will practice 
one honing it down to the bare essentials. Below is the final code you 
all helped with.

-Thanks a million!
-Mike
 Begin PERL Code
#! /usr/bin/perl -w
use strict;
use FileHandle;
my %organisms;
print Enter in a list of files to be processed:\n;
# For example:
# CytB.fasta
# NADH1.fasta
# 
chomp (my @infiles = STDIN);
# TODO we should make this nice later
#my @infiles = ('genetics.txt');
print Enter in the name of the OUTFILE:\n;

chomp (my $outfile = STDIN);

open(OUTFILE, $outfile)
or die Can't open OUTFILE: $!;
foreach my $infile (@infiles) {
my $FASTA = new FileHandle;
open  ($FASTA, $infile)
or die Can't open INFILE: $!;

# I moved this variable outside the while-loop
# in order to be able to assign the data in
# the nextline to the organism it belongs to
# (we're keeping track of the last start line
#  that we came across here)
my $orgID;
while (defined($_ = $FASTA)) {
 chomp;
 print \nWorking on $_\n;
 # see if this line is the start of an
 # organism; the thing we´re searching for
 # looks like this:
 #  dog
 # so try to match something like
 #   \s*   zero-to-many characters of
 # optional whitespace
 #the bigger-than sign
 #   \w+   one-to-many (word) characters
 # the parenthesis around the \w+ means that
 # we want to access this value later using $1
 if (/\s*(\w+)/) {  
$orgID = $1;
print Found a new organism start line ('$orgID')\n;
 }
 # or just some data belonging to the last
 # organism we found
 else {
print Sequence data found: $_\n;
print Appending data to $orgID\n;

# let´s check if we´ve got data for this entry
if (exists ($organisms{$orgID})) {
# TODO append the data to the hash here
$organisms{$orgID} .= $_;
}
else {
# create a new hash entry for this data
$organisms{$orgID} = $_;
}   
 }  
 }
 # do not forget to close the input file
 close ($FASTA)
or die could not close INFILE : $!;
}
# we've processed all input files...print the resulting hash
print \n\n;
while (my ($orgID, $sequence) = each(%organisms)) {
print OUTFILE $orgID\n$sequence\n\n;
}
END PERL CODE
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



RE: combining data from more than one file...

2004-05-20 Thread Michael Robeson
Sorry, I meant to upload this script (see below). However, I have one 
last question. Why can't I use

s/\n//g;# instead of
tr/A-Za-z-//cd;

in the script below? I thought it would be simpler to remove the 
newline characters from $_ which is all I really want to do. However, 
most of the time all I will see are - and letters which is why I set 
the tr function the way I did.

I just couldn't figure out why the substitution function wouldn't work 
in this case. How am I setting it up wrong?

-Thanks
-Mike
 BEGIN PERL SCRIPT
#! /usr/bin/perl -w
use strict;
use FileHandle;
my %organisms;
print Enter in a list of files to be processed:\n;
# For example:
# CytB.fasta
# NADH1.fasta
# 
chomp (my @infiles = STDIN);
print Enter in the name of the OUTFILE:\n;

chomp (my $outfile = STDIN);

open(OUTFILE, $outfile)
or die Can't open OUTFILE: $!;
foreach my $infile (@infiles) {
my $FASTA = new FileHandle;
open  ($FASTA, $infile)
or die Can't open INFILE: $!;

my $orgID;
while (defined($_ = $FASTA)) {
 chomp;
 print \n Processing $_\n;
  if (/\s*(\w+)/) {

$orgID = $1;
print Found a new organism start line ('$orgID')\n;
 }

 else {

tr/A-Za-z-//cd;  # originally tried  s/\n//g;

print Sequence data found: $_\n;
print Appending data to $orgID\n;

$organisms{$orgID} .= $_;

 }  
 }
 # do not forget to close the input file
 close ($FASTA)
or die could not close INFILE : $!;
}
# we've processed all input files...print the resulting hash
print \n\n;
while (my ($orgID, $sequence) = each(%organisms)) {
print OUTFILE $orgID\n$sequence\n\n;
}
 END PERL SCRIPT 
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: combining data from more than one file...

2004-05-18 Thread Michael Robeson
Ok great. Most of what you show does make sense. However, there are 
some bits of code that I need further clarification with. Some bits I 
am able to tell what they are doing but I do not quite know how or why 
they work they way they do. I'll state these areas in the code we've 
got together at this point.

Hopefully, I have copied over the bits you wrote correctly. I find this 
is like learning Spanish. I can read and (roughly) get the gist of the 
code. But when it comes to writing the original code on my own is when 
I have trouble. I am sure this will go away when I practice more. :-)

I didn't finish everything because I just need some code explained / 
clarified.

Start PERL code
#!usr/bin/perl -w
use strict;
use FileHandle;
# I am unsure of what this module is. I've tried looking it up
# in the Camel and Llama book to no avail, not enough description.
# I guess I have to figure out the whole object thing?
my %organisms;
print Enter in a list of files to be processed:\n;
# For example:
# Cytb.fasta
# NADH1.fasta
# ...
# chomp (my @infiles = STDIN);
# TODO we should make this nicer later
my @infiles = ('genetics.txt');
foreach my $infile(@infiles) {
my $FASTA = new FileHandle;

# Does the above statement tell PERL to create a new
# filehandle for each file it finds? I guess I need to understand
# what new and the module FileHandle are doing.
open ($FASTA, $infile)
or die Can't open INFILE:$!;

#$/='' #Set input operator
my $orgID;
while (defined($_ = $FASTA)) {

# Above I am unsure of why the defined function
# helps us here? I know it has something to do with an
# expression containing a valid string, but I am unsure
# of it's function here. This is something I would have
# never thought to do.  :-)

chomp;
print \nworking on $_\n;

if (\s*(\w+)/) {
$orgID=$1;
print Found a new organism start line ('$orgID')\n;

# The above regex makes complete sense. Actually, I was going to put
# something similar to that in my original post but wasn't sure
# if this was appropriate at the time. I guess it was!

} else {
print This is just some data: $_\n;
print This data needs to be appended to the hash entry for $orgID/n;
# okay, in the above you are taking the left over
# sequence ($_) and linking it as a value to $orgID ?

if (exists ($organsims{$orgID})) {
#TODO append the data to the hash here

# I guess I would put the following to append to
# the already existing hash:
# $organism{$orgID} .= $_;

} else {
#create new hash entry for this data
$organsims{$orgID} = $_;
}
}   
}

# Do not forget to close the input file
close ($FASTA)
or die Could not close INFILE: $!;
# We've processed all input files... print the resulting hash
print \n*\n;
while (my($orgID, $sequence) = each(%organisms)) {
# since I want the output as:
# cat
# actgac---cgatc-ag-cttag---acg
# dog
# actatc---actat-at-accta---atc
# I would change the print statement to:
print  . $orgID\n $sequence\n;
}
end;
end PERL code
Thanks for all your help so far! Most of this is starting help my 
thinking. I will be doing a lot more of this multi-file parsing as most 
of my work entails manipulating data in several files or folders at 
once.

-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response