position data

2004-11-07 Thread Michael S. Robeson II
Well, with much help I have ben able to come up with this currently not 
working code:

#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my(%gap, %gap_pos, $animal);
while () {
  if (/>(\w+)/) {
$animal = $1;
  } else {
while (/(-+)/g) {
  my $gap_length = length $1;
  my $position = pos ($1);
  $gap{$animal}{$gap_length}++;
  push (@{$gap{$animal}{$gap_length}}, $position);
  }
  }
}
print Dumper \%gap;
__DATA__
>human
acgtt---cgatacg---acgact-t
>chimp
acgtacgatac---actgca---ac
>mouse
acgata---acgatcgacgt
Actually, the code will work fine if you remark the second and last 
lines of the inner while loop. Anyway, I am having trouble adding 
position data to my Hashes. I would like Data Dumper to output data 
like this (I always get my syntax messed up so I will just show part of 
$VAR1, but hopefully, you'll understand):

$VAR1 = {
  'human' => {
   '5' => '1' =>  {
25
}
   '3' => '2' =>  {
6,
16,
 },
}
So, I am trying to figure out why the code I have does not work? What 
am I "not getting"? Any suggestions?

-Thanks
-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



comparing data between hashes

2004-10-24 Thread Michael S. Robeson II
I have a several "hash of a hashes" that look like this (sorry if my 
formating is a little off):

Human => {			# HoH for human
	1  => [1,32,54,67]	# numbers in [ ] is a string delimited by commas 
not separate hash values
	2  => [14,52,74,87]
	5  => [33,44,64,107]
}
Chimp => {			# HoH for Chimp
	1  => [1,32,67]	
	2  => [14,74,87]
	5  => [33,44]
}

Note: The numbers in between the [ ] is a STRING delimited by commas 
and NOT separate hash values. I already have a working script that 
appends these numbers to the appropriate hash separated by a comma or 
newline like this:
		$class{$position} .= "$pos" . " ,"; 		#to get:  5  => [33,44,64,107]

And I am having trouble trying to compare data. I want to compare each 
number (i.e. 1, 2, & 5) and its data with e the same number in the 
other species. For example, I would like to print out a table (see 
below) that compares the data between the group "1" in each species.

Alleles for set: 1
Allele: 1   32  54  67
Human   1   1   1   1   # 1 = present
Chimp   1   1   0   1   # 0 = absent
And so on for 2 and 5.
I can most likely do the print formating for the table myself so I do 
not need help with that. I just need help with being able to compare 
the data within each Hash of a Hash (HoH). I do not know if I should 
read this data into yet another Hash (which would be very busy) or an 
array somehow? I have been trying to figure this out all week to no 
avail.

Any ideas or suggestions?
- Thanks
- Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: Counting gaps in sequence data revisited.

2004-10-17 Thread Michael S. Robeson II
I cleaned up the code a little. So, here it is for anyone interested:
#!usr/bin/perl
# By Michael S. Robeson II with the help from the folks at lernperl.org 
and bioperl.org
# 10/16/2004
# Last updated: 10/17/2004
# This script was made for the purpose of searching for indels (gaps) 
in aligned
# DNA or protein sequences that are in FASTA format. It tallys up all 
of the different
# size gaps within each sequence string. While it does this it counts 
the number of
# times each gap of a given size is represented in each sequence and at 
the same time
# reports all of the positions that that particular "gap-size" or indel 
appears.
# contact: [EMAIL PROTECTED] if you have questions or comments

use warnings;
use strict;
#
# Introduction
#
print "\n\t**Welcome to Mike Robeson's Gap-Counting Script!**\n
	A - Just be sure that your sequence alignment file
	is in FASTA format!
	B - Make sure there are no duplicate names within an individual file!
	C - Output file will be based on the name of the input file. It is 
named
	by appending \'indel_list_\' to the name of your input file.\n\n";

###
# Open Sequence Data & OUTFILE
###
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = );
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
open(OUTFILE, ">indel_list_"."$dna_seq")
or die "Can't open outfile: $!\n";
#
# Read sequence data into a hash
#
my %sequences;
$/ = '>';
print "\n***Discovered the following DNA sequences:***\n";
while (  ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;  
s/\s//g;
$sequences{$name} = $_; 
print "$name found!\n";   

print"\n";
}
close DNA_SEQ;
##
# Iterate over gaps and write to file
##
foreach (keys %sequences) { 
print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";   
print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";   

my $dna = $sequences{$_};   
my %gap_data;   
my %position;   
while ($dna =~ /(\-+)/g) {
my $gap_length = length $1;
my $gap_pos = pos ($dna) - $gap_length + 1;
$gap_data{$gap_length}++;
$position{$gap_length} .= "$gap_pos"." \n";
}

my @indels = keys (%gap_data);  
my @keys = sort { $a <=> $b} @indels; 
	foreach my $key (@keys) {
		print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print "Positions:\n";
		print OUTFILE "Positions:\n";
		print "$position{$key}";
		print OUTFILE "$position{$key}";
		print "\n";
		print OUTFILE "\n";
		}
} 

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Counting gaps in sequence data revisited.

2004-10-17 Thread Michael S. Robeson II
I just wanted to thank everyone for their help and suggestions. This is 
the final full working code to count continuos gaps in a every sequence 
in a multi-sequence FASTA file. It may not be elegant but it is fast 
and works well. Sorry for the long post but I wanted to share this with 
those that do any DNA work. :-)

I show in order: the format of the input data, the output of the input 
data, and finally the working script. I have yet to add comments into 
the code but I am sure many of you veterans will figure it out.

-Thanks again all! As always comments, questions or suggestions are 
welcome!
-Mike

 INPUT 
>dog
atcg--acgat---act-ca
>cat
acgt-acgtacgt-gt-agct-
>mouse
---acgtacg-atcg---actgac-
--- OUTPUT -
***Discovered the following DNA sequences:***
dog found!
cat found!
mouse found!
>>>>>> mouse <<<<<<
Indel size: 1   Times found:2
Positions:
11
25
Indel size: 3   Times found:2
Positions:
1
16
>>>>>> cat <<<<<<
Indel size: 1   Times found:4
Positions:
5
18
21
26
Indel size: 4   Times found:1
Positions:
10
>>>>>> dog <<<<<<
Indel size: 1   Times found:1
Positions:
18
Indel size: 2   Times found:1
Positions:
5
Indel size: 3   Times found:    1
Positions:
12
Indel size: 4   Times found:1
Positions:
21
--- Script ---
#!usr/bin/perl
# By Michael S. Robeson II, with the help of friends at lernperl.org 
and bioperl.org! :-)
# 10/16/2004

use warnings;
use strict;
###
# Open Sequence Data & OUTFILE
###
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = );
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
open(OUTFILE, ">indel_list_"."$dna_seq")
or die "Can't open outfile: $!\n";

# Read sequence data into a hash

my %sequences;
$/ = '>';
print "\n***Discovered the following DNA sequences:***\n";
while (  ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;
##
# iterate over gaps and write to file
##
foreach (keys %sequences) {
print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
my $dna = $sequences{$_};
my %gap_data;
my %position;
while ($dna =~ /(\-+)/g) {
my $gap_pos = pos ($dna) - length($&) + 1;
my $gap_length = length $1; #$1 =~ tr/\-+//
$gap_data{$gap_length}++;
$position{$gap_length} .= "$gap_pos"." \n";
}

my @indels = keys (%gap_data);
my @keys = sort { $a <=> $b} @indels;

foreach my $key (@keys) {
print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print "Positions:\n";
print OUTFILE "Positions:\n";
print "$position{$key}";
print OUTFILE "$position{$key}";
print "\n";
print OUTFILE "\n";
}
# Can replace the last "foreach loop" above with the while loop
# below to do the same thing. Only Gap sizes will not be sorted.
# nor is it set up to print to a file
#   
# while (my ($key, $vlaue)  = each (%gap_data)) {
#   print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
#   print "Positions:\n";
#   print "$position{$key}";
#   print "\n\n";
# }
} 

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Moving between hashes 2.

2004-09-19 Thread Michael S. Robeson II
Ok, well I think I can see the forest but I have little idea as to what 
is actually going on here. I spent a few hours looking things up and I 
have a general sense of what is actually occurring but I am getting 
lost in the details that were posted in the last digest. See below:

On Sep 19, 2004, at 10:08, [EMAIL PROTECTED] wrote:
I see that you also made use of arrays. It struck me that, since the 
starting point is strings and not lists, using substr() would be more 
straight-forward:

  my %hash3;
  for ( keys %hash1 ) {
while ( my $aa = substr $hash1{$_},0,1,'' ) {
	I have never seen anything like this nor can I find anything in any of 
my Perl books to help me explain what the 0,1 and the " are doing to 
the substr of $hash1. I assume it is position information of some kind? 
If so, what is going on?

  $hash3{$_} .= $aa eq '-' ? '---' : substr $hash2{$_},0,3,'';
	
	This is something new to me. I think I follow your use of the ?: 
pattern feature. However, none of the perl books I have discuss it's 
use in this fashion. So, I am unsure of how you know to do that, or 
rather... how would I have known that I can do that? But basically I 
see that you are looking for '-' and equating it with what is matching 
between the ? and :  (i.e. '---').

	So, as far as I can tell, you are saying: "hey, if you find '-' in $aa 
then append a '---' in $hash3, otherwise append the next three DNA 
letters". However, I do not understand the syntax of how perl is 
actually doing this.

Help with explanation would be greatly appreciated. As you can see I 
can see what the big picture is, it's just that I am unable to 
determine mechanistically how perl is actually going about doing it. 
Also, any online references to the techniques used above would be 
great. I'd look for them myself but I do not know what some of these 
are actually called?

-Thanks so much, I have learned a little just from this much so far.
-mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Moving between hashes.

2004-09-17 Thread Michael S. Robeson II
I have two sets of data that have been stored in hashes. The first hash 
has amino-acid (protein) sequence data. The second hash has the 
corresponding DNA sequence of those amino-acids:

Hash 1
key:value:
cat   = mfgdhf
doq =   mfg--f
mouse =   mf-d-f
Hash 2
key:value:
cat = agtcatgcacactgatcg
dog = agtcatgcatcg
mouse = agtcatcactcg
And I need to insert gaps (missing or absent data) proportionally into 
the DNA sequence (Hash 2) so that the output is as follows:

Hash 3
key:value:
cat =   agtcatgcacactgatcg
dog =   agtcatgca--tcg
mouse = agtca---tca---ctcg
It doesn't look right here, but all the lines should end up being the 
same length with courier font. Basically, I am having trouble scanning 
though, say...  hash1{cat} and for every  dash found there being 
finally represented as  three dashes in hash2{cat}. Also, every 
amino-acid is represented by 3 DNA letters. This is why I need to move 
in increments of 3 and add in increments of 3 for my final data to 
appear as it does in Hash 3.

Example of relationship:
M F DF  = amino-acid
agt tca --- act --- tcg  = dna
I have everything else set up I just need a few suggestions on how to 
do the above. Any help will be greatly appreciated.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: nested "if"

2004-07-04 Thread Michael S. Robeson II
Great, thanks for the help!
-Mike
On Jul 3, 2004, at 2:16 PM, Gunnar Hjalmarsson wrote:
Michael S. Robeson II wrote:
No, your post was not in the last e-mail digest I received,
I see. Sometimes I think that digest mode for mailing lists is a
nuisance. ;-)
But the link you provided seems to clear things up for me. So, it's
not about the order of operation but the timing of when the
variables actually get defined.
That is, during the beginning of the loop since the values are not
defined then they are ignored. When the loop proceeds again these
values (at this point) are now defined because of the previous
iteration of the loop has set the values from the else statement?
Yep, that's it.
I would suggest that you to post the above comment to the list, too.
Rgds,
Gunnar

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Re: nested "if"

2004-07-04 Thread Michael S. Robeson II
No, your post was not in the last e-mail digest I received, unless I 
missed it somehow? But the link you provided seems to clear things up 
for me. So, it's not about the order of operation but the timing of 
when the variables actually get defined.

That is, during the beginning of the loop since the values are not 
defined then they are ignored. When the loop proceeds again these 
values (at this point) are now defined because of the previous 
iteration of the loop has set the values from the else statement?

-Mike
On Jul 2, 2004, at 10:25 PM, Michael S. Robeson II wrote:
Well yeah, the indentation makes it much more clearer. However, this 
does not help me understand how the nested "if" statements are 
working. Which of the two "if" statements gets evaluated first? I am 
trying to figure out "in english" what the "if" statements are 
actually doing. Is it saying:

"If a line begins with  ">bla-bla"  and if  $seq  (which appears no 
where else in the code other than " $seq="" ") exists assign it to the 
hash "pro" with the name "bla-bla"."

So my question is how does the inner if statement work when seq="" is 
out side that "if" statement?

 Is the outer  "if" statement evaluated first then the inner? Because 
how does the inner "if" statement know what "$seq" is?

I am probably not making any sense but I am trying to figure out 
mechanically how the perl interpreter knows what to do in the context 
of the nested if statements.

-Thanks
-Mike
Gunnar Hjalmarsson wrote:
That illustrates the importance of indenting the code in a way that 
makes sense:

while () {
$line=$_;
if ($line=~/^>(.+)/) {
if ($seq) {
$pro{$name}=$seq;
#print "SEQ:\n$pro\n\n";
}
$name=$1;
$name=~s/\s//g;
push @names, $name;
#print "$name\n";
$k++;
$seq="";
} else {
chomp $line;
$seq.=$line;
}
}
Quite a difference, isn't it?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Re: nested "if"

2004-07-03 Thread Michael S. Robeson II
Ok, that make much more sense - I think. So, I guess, the outer 'if' 
and 'else' statements get evaluated first. Then the inner 'if' can 
proceed once all the lines of data were gathered in the outer 'else' 
statement. This way the lines can be assigned as a key-value pair in 
the hash. I guess the individual who wrote the code could make it much 
cleaner or easier to read. The re-organizing of the conditionals in 
your second e-mail makes it perfectly clear and would be something I 
would have done had I known how the nested 'if' statement was working 
(again if I have right). Hopefully, I 'got it' now. I can see why many 
coders are annoyed with nested if statements.  :-)

-Thanks!
-Mike

On Jul 3, 2004, at 4:38 AM, Randy W. Sims wrote:
On 7/2/2004 10:25 PM, Michael S. Robeson II wrote:
Well yeah, the indentation makes it much more clearer. However, this 
does not help me understand how the nested "if" statements are 
working. Which of the two "if" statements gets evaluated first? I am 
trying to figure out "in english" what the "if" statements are 
actually doing. Is it saying:
"If a line begins with  ">bla-bla"  and if  $seq  (which appears no 
where else in the code other than " $seq="" ") exists assign it to 
the hash "pro" with the name "bla-bla"."
So my question is how does the inner if statement work when seq="" is 
out side that "if" statement?
 Is the outer  "if" statement evaluated first then the inner? Because 
how does the inner "if" statement know what "$seq" is?
I am probably not making any sense but I am trying to figure out 
mechanically how the perl interpreter knows what to do in the context 
of the nested if statements.
Tidied up a little more:
my( %pro, @names);
my( $name, $seq, $k );
while (defined( my $line =  )) {
if ($line =~ /^>(.+)/) {
if ($seq) {
$pro{$name} = $seq;
$seq = '';
}
$name = $1;
$name =~ s/\s//g;
push @names, $name;
$k++;
} else {
chomp( $line );
$seq .= $line;
}
}
This code deals with multi-line sequences, putting multiple lines 
together untill a sequence is complete. The 'else' part of the outter 
'if' does the accumulation of multiple lines into a sequence. The 'if' 
part determines that a sequence is complete, captures some type of 
name from the sequence, stores the complete sequence in the '%pro' 
hash, and pushes the name onto a '@names' array. I'm guessing '$k' 
keeps tally of the number of sequences; it's not clear if that is 
neccessary since `scalar @names` possibly will provide the same info. 
It's also unclear why there is a '@names' array that mostly duplicates 
`keys %pro`

I think where you're-understandably-getting confused is that most of 
those variables are global. That's made more explicit in my 
strictified rewrite above. It could probably be rewritten better if we 
knew the exact format of the data being read.

Randy.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Re: nested "if"

2004-07-03 Thread Michael S. Robeson II
Well yeah, the indentation makes it much more clearer. However, this 
does not help me understand how the nested "if" statements are working. 
Which of the two "if" statements gets evaluated first? I am trying to 
figure out "in english" what the "if" statements are actually doing. Is 
it saying:

"If a line begins with  ">bla-bla"  and if  $seq  (which appears no 
where else in the code other than " $seq="" ") exists assign it to the 
hash "pro" with the name "bla-bla"."

So my question is how does the inner if statement work when seq="" is 
out side that "if" statement?

 Is the outer  "if" statement evaluated first then the inner? Because 
how does the inner "if" statement know what "$seq" is?

I am probably not making any sense but I am trying to figure out 
mechanically how the perl interpreter knows what to do in the context 
of the nested if statements.

-Thanks
-Mike
Gunnar Hjalmarsson wrote:
That illustrates the importance of indenting the code in a way that 
makes sense:

while () {
$line=$_;
if ($line=~/^>(.+)/) {
if ($seq) {
$pro{$name}=$seq;
#print "SEQ:\n$pro\n\n";
}
$name=$1;
$name=~s/\s//g;
push @names, $name;
#print "$name\n";
$k++;
$seq="";
} else {
chomp $line;
$seq.=$line;
}
}
Quite a difference, isn't it?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



nested "if"

2004-07-02 Thread Michael S. Robeson II
I came across some code on the internet that looks like this (this is 
only part of the script):

while () {
  $line=$_;
if ($line=~/^>(.+)/) {
if ($seq) {
$pro{$name}=$seq;
#print "SEQ:\n$pro\n\n";
}
$name=$1;
$name=~s/\s//g;
push @names, $name;
#print "$name\n";
$k++;
$seq="";
}
else {
chomp $line;
$seq.=$line;
}
}
I am having trouble figuring out how the nested if statements work 
(i.e. what is the order of operation etc...) and their associated else 
statements. I pretty much understand the rest of what is going on but I 
am having trouble "putting into words" what the nested if statements 
are doing. I mean I know enough that the code is... ummm... yuck!!!   
:-)

-Thanks for any help!
-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: combining data from more than one file...

2004-05-17 Thread Michael S. Robeson II
Well this is the best I could do thinking through what you said. This 
is actually my first time working with hashes. Also, I am still a PERL 
newbie. So, I guess a little helpful code would go a long way. I just 
can't figure out how to link the regular expressions to the hash when 
searching through the multiple files. to do as you say:

***Philipp wrote:***
- open the first file
- search for the beginning of an "organism" (say: ">cat"), read 
everything
after this point
- search in your hash if you already stored data of this organism
  - if yes, append your new sequence to the already existing data
  - if no, create a new key in the hash
- repeat this until you run out of "organisms"
- repeat the whole procedure until you run out of files

***end***
#!/usr/bin/perl
# This script will take separate FASTA files and combine the "like"
# data into one FASTA file.
#
use warnings;
use strict;
my %organisms (
"$orgID" => "$orgSeq",
 );
print "Enter in a list of files to be processed:\n";
# For example:
# CytB.fasta
# NADH1.fasta
# 
chomp (my @infiles = );
foreach $infile (@infiles) {
open  (FASTA, $infile)
or die "Can't open INFILE: $!";
$/='>'; #Set input operator
while (FASTA) {
chomp;
# Some regular expression match here?
# something that will set, say... ">cat"
# as the key "$orgID", something similar
# to below?
# and then set the sequence as the value
# "$orgSeq" like below?
# Do not know if or where to put the following,
# but something like:
if (exists $organisms{$orgID}) {
# somehow concatenate "like" data
# from the different files
}
# print the final Hash to an outfile?
}
 yeah, I'm lost.  :-)
-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



combining data from more than one file...

2004-05-17 Thread Michael S. Robeson II
Hi all,
I am having trouble with combining data from several files, and I can't  
even figure out how to get started. So, I am NOT asking for any code  
(though pseudo-code is ok) as I would like to try figuring this problem  
out myself. So, if anyone can give me any references or hints that  
would be great.

So, here is what I am trying to do:
I have say 2 files (I'd like to do this to as many files as the user  
needs):

***FILE 1***
>cat
atacta--gat--acgt-
ac-ac-ggttta-ca--
>dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cac
>mouse
acagctagc-atgca--
acgtatgctacg--atg-
***end file 1***
***FILE 2***
>mouse
aatctgatcgc-atgca--
acgtaaggctagg-
>cat
atacta--gat--acgt-
ac-acacagcta--ca--
>dog
atgcgtatgc-atcgat
-ac--ac-a-ac-a-cac
***end file 2***
Basically, I would like to concatenate the sequence of each  
corresponding animal so that the various input files would  be out put  
to a file like so:

***output***
>cat
atacta--gat--acgt-ac-ac-ggttta-ca--atacta--gat--acgt-ac-acacagcta--ca--
>dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cacatgcgtatgc-atcgat-ac--ac-a-ac-a-cac
>mouse
acagctagc-atgca--acgtatgctacg--atg-aatctgatcgc-atgca-- 
acgtaaggctagg-
***output end***

Notice that in the two files the data are not in the same order. So, I  
am trying to figure out how to have the script figure out what the  
first organism is in FILE 1( say "cat" in this case) and find the  
corresponding "cat" in the other input files. Then take the sequence  
data (all the cat data) from FILE 2 and concatenate it to the cat  
sequence data in FILE 1 to an output file. Then it should go on to the  
next organism in FILE 1 and search for that next organism in the other  
files (in this case FILE 2). I do not care about the order of the data,  
only that the "like" data is concatenated together.

Again, I do NOT want this solved for me (unless I am totally lost).  
Otherwise, I'll never learn. I would just like either hints /  
suggestions / pseudo code / even links to books or sites that discuss  
this particular topic. Meanwhile, I am eagerly awaiting my "PERL  
Cookbook" and I'll keep searching the web.

-Thanks!
-Mike

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



combining data from more than one file...

2004-05-17 Thread Michael S. Robeson II
Hi all,
I am having trouble with combining data from several files, and I can't  
even figure out how to get started. So, I am NOT asking for any code  
(though pseudo-code is ok) as I would like to try figuring this problem  
out myself. So, if anyone can give me any references or hints that  
would be great.

So, here is what I am trying to do:
I have say 2 files (I'd like to do this to as many files as the user  
needs):

***FILE 1***
>cat
atacta--gat--acgt-
ac-ac-ggttta-ca--
>dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cac
>mouse
acagctagc-atgca--
acgtatgctacg--atg-
***end file 1***
***FILE 2***
>mouse
aatctgatcgc-atgca--
acgtaaggctagg-
>cat
atacta--gat--acgt-
ac-acacagcta--ca--
>dog
atgcgtatgc-atcgat
-ac--ac-a-ac-a-cac
***end file 2***
Basically, I would like to concatenate the sequence of each  
corresponding animal so that the various input files would  be out put  
to a file like so:

***output***
>cat
atacta--gat--acgt-ac-ac-ggttta-ca--atacta--gat--acgt-ac-acacagcta--ca--
>dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cacatgcgtatgc-atcgat-ac--ac-a-ac-a-cac
>mouse
acagctagc-atgca--acgtatgctacg--atg-aatctgatcgc-atgca-- 
acgtaaggctagg-
***output end***

Notice that in the two files the data are not in the same order. So, I  
am trying to figure out how to have the script figure out what the  
first organism is in FILE 1( say "cat" in this case) and find the  
corresponding "cat" in the other input files. Then take the sequence  
data (all the cat data) from FILE 2 and concatenate it to the cat  
sequence data in FILE 1 to an output file. Then it should go on to the  
next organism in FILE 1 and search for that next organism in the other  
files (in this case FILE 2). I do not care about the order of the data,  
only that the "like" data is concatenated together.

Again, I do NOT want this solved for me (unless I am totally lost).  
Otherwise, I'll never learn. I would just like either hints /  
suggestions / pseudo code / even links to books or sites that discuss  
this particular topic. Meanwhile, I am eagerly awaiting my "PERL  
Cookbook" and I'll keep searching the web.

-Thanks!
-Mike

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: formatting the loop

2004-02-12 Thread Michael S. Robeson II
On Feb 11, 2004, at 2:55 PM, James Edward Gray II wrote:

[snip]

my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];

If I may, yuck!  This builds up a list of all the A-Za-z characters in 
the string, adds a boat load of extra - characters, trims the whole 
list to the length you want and stuffs all that inside @char.  It's 
also receives a rank of "awful", on the James Gray Scale of 
Readability.  ;)
[snip]

Ok, now I understand. I found that my problem was with how the "next" 
command was operating in conjunction with the grouping of characters. 
Ok, making progress.  :-)

Now, about that array slice I have:

my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len - 1];

I know it wastes a lot of memory and makes perl do much extra work. 
However, when I try to replace that line with something like this:

my @char = ( /[a-z]/ig, ( '-' ) x ($len - length) ;

it doesn't work the way I thought it would (gee what a thought). I 
would like to express the code similar to
( '-' ) x ($len - length)
 because it is easy for me to read and it tells you clearly what is 
going on. However, every time I try to implement something like that I 
get unexpected output or I have to really rewrite the loop.  Which I 
have been unable to troubleshot as you have been seeing. :-)  I think 
the 'length' command it also counting any '\n' characters or something, 
because my out put ends up with different lengths like this when I use 
the ($len - length) way :

 a c u g a c g a g u - - - - - - - -   bob
 a c u g a c u a g c u g - - - - - - -   fred
with this input:

>bob
actgacgagt
>fred
actgactagctg
The reason I went with  /[a-z]/ig is because some sequence data uses 
other letters to denote ambiguity and other things. I guess I can only 
list the letters it uses but I was just lazy and typed in the entire 
range of "a to z".

I will be continuing to work on it but here is the code as it stands 
now (with that awful array slice).

#!/usr/bin/perl

use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";

# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

chomp (my $infile = );

open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";

# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

chomp (my $outfile = );

open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) =  =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed

$/ = '>';  # Set input operator

while (  ) {
chomp;
next unless s/^\s*(.+)//;  # delete name and place in memory
my $name = $1;	 # what ever in memory saved as $name
my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len -1];	# take only 
sequence letters and
		# and add '-' to the end
my $sequence = join( ' ', @char);		# turn into scalar
$sequence =~ tr/Tt/Uu/;	# convert T's to U's
print OUTFILE " $sequence   $name\n";
}

close INFILE;
close OUTFILE;
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: formatting the loop

2004-02-11 Thread Michael S. Robeson II
See comments below.

On Feb 11, 2004, at 2:55 PM, James Edward Gray II wrote:

On Feb 11, 2004, at 1:27 PM, Michael S. Robeson II wrote:

[snip]

Anyway, though it works great I am having a tough time trying to 
figure out WHY it works.
See comments below, in the code.

[snip]

I think if I can understand the mechanics behind this script it will 
only help me my future understanding of writing PERL scripts.
Perl.  The language you are learning is called Perl, not PERL.  :)

Hehe, thanks. :-)

[snip]

[snip]

$/ = '>';  # Set input operator
Here's most of the magic.  This sets Perl's input separator to a > 
character.  That means that  won't return a sequence of 
characters ending in a \n like it usually does, but a sequence of 
characters ending in a >.  It basically jumps name to name, in other 
words.

while (  ) {
chomp;
chomp() will remove the trailing >.
OK that makes pretty good sense. I understand that now, I hope. See 
next comment.



next unless s/^\s*(\S+)//;
my $name = $1;
Well, if we're reading name to name, the thing right a the beginning 
of our sequence is going to be a name, right?  The above removes the 
name, and saves it for later use.
OK, I think this is were my problem is. That is how does it know that 
the characters as in "bob" or "fred" are the names and not mistaking 
the sequence of letters "agtcaccgatg" to be placed in memory ($name). 
Basically I am reading the following:

next unless s/^\s*(\S+)//;

as "Go to the next line unless you see a line with zero or more 
whitespace characters followed by one or more non-whitespace characters 
and save the non-whitespace characters in memory."  If this is correct 
then how can perl tell the difference between the lines containing 
"bob" or "fred" (and put then in memory) and the "acgatctagc" (and not 
put these in memory) because both lines of data seem to fit the 
expression pattern to me. I think it has something to do with how perl 
is reading through the file that makes this work?

So, there is something I am "missing",  not noticing or realizing here.
Maybe I've been staring at the code for far to long and should take a 
break!  :-)



my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
If I may, yuck!  This builds up a list of all the A-Za-z characters in 
the string, adds a boat load of extra - characters, trims the whole 
list to the length you want and stuffs all that inside @char.  It's 
also receives a rank of "awful", on the James Gray Scale of 
Readability.  ;)

Yeah, I need to clean that up a bit!

[snip]

-Mike

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



formatting the loop

2004-02-11 Thread Michael S. Robeson II
Hi all!

Well, based on the input I have received from everyone thus far I have 
been able to cobble the following code together (See below for the 
input and out put of of this script).

Anyway, though it works great I am having a tough time trying to figure 
out WHY it works. I am especially having trouble with the line: "next 
unless s/^\s*(\S+)//" in relation to the while loop it is in. 
Basically, I do not understand how the script is differentiating the 
">bob" line in the input from the lines of "agactgatcg" (again see 
input and output at bottom). I know that the "$/" has something to do 
with this, but I am not sure how or why it works.

I hate to sound like a dummy, but if anyone can help me understand WHAT 
the script is doing in the "while loop" I would really appreciate it. I 
think if I can understand the mechanics behind this script it will only 
help me my future understanding of writing PERL scripts. Especially, 
when it comes to regular expressions and loops. Heck, if there is a 
better way to do certain parts of this let me know! Also, special 
thanks to James Gray for the help thus far!! Till then, I'll be 
wracking my head with my PERL books!

The working script:
_
#!/usr/bin/perl

use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";

# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

chomp (my $infile = );

open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";

# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

chomp (my $outfile = );

open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) =  =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file.

$/ = '>';  # Set input operator

while (  ) {
chomp;
next unless s/^\s*(\S+)//;
my $name = $1;
my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
my $sequence = join( ' ', @char);
$sequence =~ tr/Tt/Uu/;
print OUTFILE " $sequence   $name\n";
}
close INFILE;
close OUTFILE;
___

Again this script is to convert the following data existing as either 
single line or multiline sequence data:

### input type 1 ###
>bob
atcgactagcatcgatcg
acacgtacgactagcac
>fred
actgactacgatcgaca
acgcgcgatacggcat
#
or (as I posted originally)

### input type 2 ###
>bob
atcgactagcatcgatcgacacgtacgactagcac
>fred
actgactacgatcgacaacgcgcgatacggcat
#
###output##
## Note that the T's are converted to U's in the output! ##
R 1  42

 a u c g a c u a g c a u c g a u c g a c a c g u a c g a c u a g c a c 
- - - - - - -   bob
 a c u g a c u a c g a u c g a c a a c g c g c g a u a c g g c a u - - 
- - - - - - -   fred





--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



formatting and syntax

2004-02-04 Thread Michael S. Robeson II
Hi I am all still to new to PERL and I am having trouble playing with 
formatting my data into a new format. So here is my problem:

I have data (DNA sequence) in a file that looks like this:


# Infile

>bob
AGTGATGCCGACG
>fred
ACGCATATCGCAT
>jon
CAGTACGATTTATC
and I need it converted to:


# Outfile

R 1 20
 A G U G A T G C C G A C G - - - - - - -   bob
 A C G C A U A U C G C A U - - - - - - -   fred
 C A G U A C G A U U U A U C - - - - - -   jon
The "R 1" is static and should always appear. The "20" at the top of 
the new file should be a number defined by the user, that is they 
should be prompted for the length they wish the sequence to be. That is 
the total length of the sequence plus the added dashes could be 20 or 
3000 or whatever.  So, if they type 20 and there is only 10 letters in 
that row then the script should add 10 dashes to bring that total up to 
the 20 chosen by the user.

Note that there should be a space between all letters and dashes - 
including a space at the beginning. Then there are supposed to be 7 
spaces after the sequence string followed by the name as shown in the 
example output file above. Also, of note is the fact that all of the 
T's are changed to U's. For those of you that know biology I am not 
only switching formats of the data but also changing DNA to RNA.

I hope I am explaining this clear enough, but here (see below) is as 
far as I can get with the code. I just do not know how to structure the 
loop/code to do this. I always have trouble with manipulating data the 
way I want when it comes to a loop. I would prefer an easier to 
understand code rather than an efficient code. This way I can learn the 
simple stuff first and learn the short-cuts later. Thanks to anyone who 
can help.

- Cheers!
- Mike
##
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";

# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

chomp (my $infile = );

open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";

# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

chomp (my $outfile = );

open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) =  =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed

# type of loop or structure to follow ?

#

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]