Re: Counting gaps in sequence data revisited.

2004-10-26 Thread Zeus Odin

Michael S. Robeson II [EMAIL PROTECTED] wrote in message ...

 open(DNA_SEQ, $dna_seq)

Due to precedence, the parens are optional here.

 open(OUTFILE, indel_list_.$dna_seq)
^
indel_list_$dna_seq  or   indel_list_.$dna_seq

 foreach (keys %sequences) {
 print \t\t\\ $_ \\\n;
 print OUTFILE \\ $_ \\\n;

The  and  do not need to be escaped.

 $position{$gap_length} .= $gap_pos. \n;

$gap_pos \n or $gap_pos\n

 foreach my $key (@keys) {
 print Indel size:\t$key\tTimes found:\t$gap_data{$key}\n;
 print OUTFILE Indel size:\t$key\tTimes found:\t$gap_data{$key}\n;
 print Positions:\n;
 print OUTFILE Positions:\n;
 print $position{$key};
 print OUTFILE $position{$key};
 print \n;
 print OUTFILE \n;
 }

As a general rule, whenever you keep doing the same thing in multiple
places, think sub {}. See alternative below.

--

#!/usr/bin/perl
use warnings;
use strict;

# By pre-declaring print2mfh, you can call it without parentheses!
sub print2mfh;
my($animal, %gap);

# my $data = '/path/to/data.txt';
# open DATA, $data or die Could not open $data: $!\n;
while (DATA) {
  /(\w+)/ and $animal=$1 and next;
  while ( /(-+)/g ) {
push @{ $gap{$animal}{length $1} }, $-[1]+1;
  }
}

die No gaps found! unless scalar keys %gap;
# if you are on unix or linux, use tee to print to multiple outputs
# See Perl Cookbook 7.18
# open TEE, | tee outfile.txt or die Could not tee off!: $!\n;
# instead of print2mfh and OUT, you would do: print TEE text to print;

open OUT, output.txt or die Could not open output file: $!\n;
print ***Discovered the following DNA sequences:***\n;
print $_ found!\n for sort keys %gap;

foreach my $animal(sort keys %gap) {
  print2mfh \t\t $animal \n;
  foreach my $length(sort {$a = $b} keys %{ $gap{$animal} }) {
my $pos = [EMAIL PROTECTED] $gap{$animal}{$length} };
print2mfh Indel size:\t$length\tTimes found:\t, scalar @$pos, \n;
print2mfh join(\n, Positions:, @$pos, \n);
  }
}


sub print2mfh {
  # print to multiple file handles
  foreach my $fh(*STDOUT, *OUT) { print $fh @_ }
}

__DATA__
 dog
atcg--acgat---act-ca
 cat
acgt-acgtacgt-gt-agct-
 mouse
---acgtacg-atcg---actgac-



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-24 Thread Zeus Odin
This is a really cool problem. See solution below. Michael, next time post
some code please.

Thanks,
ZO

--
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

my(%gap, $animal);

while (DATA) {
  if (/(\w+)/) {
$animal = $1;
  } else {
while (/(-+)/) {
  $gap{$animal}{ length $1 } += s/-+//;
}
  }
}

print Dumper \%gap;

__DATA__
 human
acgtt---cgatacg---acgact-t
 chimp
acgtacgatac---actgca---ac
 mouse
acgata---acgatcgacgt
--


   bio-informatics is a big area in which Perl is involved...   there's
even
   a book from O'reilly on the subject...


  If what you say is true, then maybe Mike needs to take his questions
  to those list?  I mean, if the problem he's describing is common and
  the data format he's using is common, I bet it's been solved already.
 
  Hey mike, have you searched on CPAN (search.cpan.org) for this?
 

 that might be fine- but his question is fundamentally Perl in nature-
 he may use the information for bio-informatics, but he is looking for
 an answer regarding effective Perl use when dealing with strings-
 that's classic Perl and a classic question for this list :)

 i wish i could answer his question right off the bat, but i can't :/

 awell...



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Counting gaps in sequence data revisited.

2004-10-17 Thread Michael S. Robeson II
I cleaned up the code a little. So, here it is for anyone interested:
#!usr/bin/perl
# By Michael S. Robeson II with the help from the folks at lernperl.org 
and bioperl.org
# 10/16/2004
# Last updated: 10/17/2004
# This script was made for the purpose of searching for indels (gaps) 
in aligned
# DNA or protein sequences that are in FASTA format. It tallys up all 
of the different
# size gaps within each sequence string. While it does this it counts 
the number of
# times each gap of a given size is represented in each sequence and at 
the same time
# reports all of the positions that that particular gap-size or indel 
appears.
# contact: [EMAIL PROTECTED] if you have questions or comments

use warnings;
use strict;
#
# Introduction
#
print \n\t**Welcome to Mike Robeson's Gap-Counting Script!**\n
	A - Just be sure that your sequence alignment file
	is in FASTA format!
	B - Make sure there are no duplicate names within an individual file!
	C - Output file will be based on the name of the input file. It is 
named
	by appending \'indel_list_\' to the name of your input file.\n\n;

###
# Open Sequence Data  OUTFILE
###
print Enter in the name of the DNA sequence file:\n;
chomp (my $dna_seq = STDIN);
open(DNA_SEQ, $dna_seq)
or die Can't open file: $!\n;
open(OUTFILE, indel_list_.$dna_seq)
or die Can't open outfile: $!\n;
#
# Read sequence data into a hash
#
my %sequences;
$/ = '';
print \n***Discovered the following DNA sequences:***\n;
while ( DNA_SEQ ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;  
s/\s//g;
$sequences{$name} = $_; 
print $name found!\n;   

print\n;
}
close DNA_SEQ;
##
# Iterate over gaps and write to file
##
foreach (keys %sequences) { 
print \t\t\\ $_ \\\n;   
print OUTFILE \\ $_ \\\n;   

my $dna = $sequences{$_};   
my %gap_data;   
my %position;   
while ($dna =~ /(\-+)/g) {
my $gap_length = length $1;
my $gap_pos = pos ($dna) - $gap_length + 1;
$gap_data{$gap_length}++;
$position{$gap_length} .= $gap_pos. \n;
}

my @indels = keys (%gap_data);  
my @keys = sort { $a = $b} @indels; 
	foreach my $key (@keys) {
		print Indel size:\t$key\tTimes found:\t$gap_data{$key}\n;
		print OUTFILE Indel size:\t$key\tTimes found:\t$gap_data{$key}\n;
		print Positions:\n;
		print OUTFILE Positions:\n;
		print $position{$key};
		print OUTFILE $position{$key};
		print \n;
		print OUTFILE \n;
		}
} 

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-15 Thread Errin Larsen
On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson [EMAIL PROTECTED] wrote:
 Yeah, I have just submitted that same question verbatim to the bio-perl
 list. I am still running through some ideas though. I have both
 Bioinformatics perl books. They are not very effective teaching books.
 
 The books spend too much time on using modules. Though while I
 understand the usefulness of not having to re-write code, it is a bad
 idea for beginners like me. Because re-writing code at first gives me a
 lot of practice. Some of the scripts in the books use like 3-5 modules,
 so it gets confusing on what is going on.
 
 I mean the books are not useless, but they definitely are structured
 for a class with a teacher.
 
 :-)
 
 -Mike
 

Hi again, Mike!

I've thrown together the following code.  I have not commented this! 
If you have some questions, just ask.  I hard coded the sequences for
my ease-of-use.  It looked to me like you have figured out how to grab
the sequences out of  a file and throw them in a hash.  This code uses
some deep nested references, and therefore, some crazy dereferences. 
Have fun with it, I know I did!  Things that might look weird:  check
out perldoc -f split for info on using a null-string to split with
(That's were I found it!) and of course perldoc perlref for all the
deep nested references and dereferencing stuff!  I'm currently reading
Learning Perl Objects, References  Modules by Randal Schwartz.  I
highly recommend it.  It helped a lot in this exercise.  Here's the
code:

use warnings;
use strict;

my %sequences = (
'Human' = acgtt---cgatacg---acgact-t,
'Chimp' = acgtt---cgatacg---acgact-t,
'Mouse' = acgata---acgatcgacgt,
);
my %results;

foreach my $species( keys %sequences ) {
my $is_base_pair_gap = 0;
my $base_pair_gap;
my $base_pair_gap_pos;
my $position = 1;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
unless( $is_base_pair_gap ) {
$base_pair_gap_pos = $position;
}
$is_base_pair_gap = 1;
$base_pair_gap .= $_;
} elsif( $is_base_pair_gap ) {
push
@{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;
$is_base_pair_gap = 0;
$base_pair_gap = undef;
}
$position++;
}
}

foreach my $species( keys %results ) {
print $species:\n;
foreach my $base_pair_gap( keys %{$results{$species}} ) {
printNumber of $base_pair_gap base pair gaps:\t,
scalar( @{$results{$species}{$base_pair_gap}}), \n;
print  at position(s) , join( ',',
@{$results{$species}{$base_pair_gap}} ), .\n;
}
print \n;
}




The heart of this code is this line:
push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;

there is a %results hash which has keys that are the different
species, and values that point to another hash.  THAT hash (the inner
hash) has keys that are the length of the base-pair-gaps, and values
that point to an array.  The array holds a list of the positions of
those base-pair gaps!  The first base pair gap in the human sequence
is '---' at the 6th character.  That looks like this (warning: pseudo
code for clarity!)
  %results-{'Human'}-{ 3 }-[6]
When we find the second '---' gap, we add it's position to the array:
  %results-{'Human'}-{ 3 }-[6,16]
Then, we find a new base-pair-gap ('-') so we add a new key to inner hash:
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
Next, we move on to the next species ...
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
   -{'Mouse'}-{ 3 }-[7]

So, finally, with Data::Dumper, we can see the %results hash when the
code is done processing the sequence:

%results = {
  'Human' = {
   '3' = [
6,
16
  ],
   '5' = [
25
  ]
 },
  'Mouse' = {
   '4' = [
17
  ],
   '3' = [
7
  ]
 },
  'Chimp' = {
   '3' = [
6,
16
  ],
   '5' = [
25
  ]
 }
};
 

I hope this is helpful!  This really was a lot of fun.

--Errin

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: counting gaps in sequence data

2004-10-15 Thread Michael Robeson
Errin,
Thanks so much! I will spend the weekend going over what you've posted. 
Looks like I will learn a lot from this post alone. This stuff is so 
addictive. I can spend hours doing this and not realize it. If I am 
successful or not is another story!  :-)

I'll definitely let you know if I have any trouble.
-Cheers!
-Mike
On Oct 15, 2004, at 7:22 AM, Errin Larsen wrote:
On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson [EMAIL PROTECTED] 
wrote:
Yeah, I have just submitted that same question verbatim to the 
bio-perl
list. I am still running through some ideas though. I have both
Bioinformatics perl books. They are not very effective teaching books.

The books spend too much time on using modules. Though while I
understand the usefulness of not having to re-write code, it is a bad
idea for beginners like me. Because re-writing code at first gives me 
a
lot of practice. Some of the scripts in the books use like 3-5 
modules,
so it gets confusing on what is going on.

I mean the books are not useless, but they definitely are structured
for a class with a teacher.
:-)
-Mike
Hi again, Mike!
I've thrown together the following code.  I have not commented this!
If you have some questions, just ask.  I hard coded the sequences for
my ease-of-use.  It looked to me like you have figured out how to grab
the sequences out of  a file and throw them in a hash.  This code uses
some deep nested references, and therefore, some crazy dereferences.
Have fun with it, I know I did!  Things that might look weird:  check
out perldoc -f split for info on using a null-string to split with
(That's were I found it!) and of course perldoc perlref for all the
deep nested references and dereferencing stuff!  I'm currently reading
Learning Perl Objects, References  Modules by Randal Schwartz.  I
highly recommend it.  It helped a lot in this exercise.  Here's the
code:
use warnings;
use strict;
my %sequences = (
'Human' = acgtt---cgatacg---acgact-t,
'Chimp' = acgtt---cgatacg---acgact-t,
'Mouse' = acgata---acgatcgacgt,
);
my %results;
foreach my $species( keys %sequences ) {
my $is_base_pair_gap = 0;
my $base_pair_gap;
my $base_pair_gap_pos;
my $position = 1;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
unless( $is_base_pair_gap ) {
$base_pair_gap_pos = $position;
}
$is_base_pair_gap = 1;
$base_pair_gap .= $_;
} elsif( $is_base_pair_gap ) {
push
@{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;
$is_base_pair_gap = 0;
$base_pair_gap = undef;
}
$position++;
}
}
foreach my $species( keys %results ) {
print $species:\n;
foreach my $base_pair_gap( keys %{$results{$species}} ) {
printNumber of $base_pair_gap base pair gaps:\t,
scalar( @{$results{$species}{$base_pair_gap}}), \n;
print  at position(s) , join( ',',
@{$results{$species}{$base_pair_gap}} ), .\n;
}
print \n;
}

The heart of this code is this line:
push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos;
there is a %results hash which has keys that are the different
species, and values that point to another hash.  THAT hash (the inner
hash) has keys that are the length of the base-pair-gaps, and values
that point to an array.  The array holds a list of the positions of
those base-pair gaps!  The first base pair gap in the human sequence
is '---' at the 6th character.  That looks like this (warning: pseudo
code for clarity!)
  %results-{'Human'}-{ 3 }-[6]
When we find the second '---' gap, we add it's position to the array:
  %results-{'Human'}-{ 3 }-[6,16]
Then, we find a new base-pair-gap ('-') so we add a new key to 
inner hash:
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
Next, we move on to the next species ...
  %results-{'Human'}-{ 3 }-[6,16]
   -{ 5 }-[25]
   -{'Mouse'}-{ 3 }-[7]

So, finally, with Data::Dumper, we can see the %results hash when the
code is done processing the sequence:
%results = {
  'Human' = {
   '3' = [
6,
16
  ],
   '5' = [
25
  ]
 },
  'Mouse' = {
   '4' = [
17
  ],
   '3' = [
7
  ]
 },
  'Chimp' = {
   '3' = [
6,
16
  

Re: counting gaps in sequence data

2004-10-14 Thread Errin Larsen
On Thu, 14 Oct 2004 11:02:06 -0600, Michael Robeson [EMAIL PROTECTED] wrote:
 I have a set of data that looks something like the following:
 
  SNIP
 
 So, any suggestions would be greatly appreciated. If anyone can help me
 out with all or even just bits of this I would greatly appreciate it.
 This should help me get started on some more advanced parsing I need to
 do after this. I like to try and figure things out on my own if I can,
 so even pseudo code would be of great help!
 
 -Thanks
 -Mike
 

Hi Mike,

  This list works best if you show us some code you have tried and
then we can help you troubleshoot it.  It lets us know that you've
already tried to solve your problem and aren't looking for a free
scripting service!  Show us some of that code you talked about and
we'll help you out

--Errin

PS: is this a common problem/exercise in some class somewhere?  I keep
seeing requests for help having to do with those exact strings of DNA
data.  Is there a bunch of people working on DNA projects using Perl
somewhere?  Or, is this some homework?

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Willy West
 PS: is this a common problem/exercise in some class somewhere?  I keep
 seeing requests for help having to do with those exact strings of DNA
 data.  Is there a bunch of people working on DNA projects using Perl
 somewhere?  Or, is this some homework?

bio-informatics is a big area in which Perl is involved...   there's even
a book from O'reilly on the subject...

also, a mailing-list is available...


from http://lists.perl.org/   -

bioperl-announce-l List is for people only interested in announcements
of Bioperl code releases, updates and events.

bioperl-l Unmoderated list for general discussion of bioperl modules,
current  future efforts and related topics.

-

in the latter of the two lists, i counted about 80 messages in the first half
of this month.

hmm... i might join it... :)  might be fun!!

-- 
Willy
http://www.hackswell.com/corenth

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Errin Larsen
On Thu, 14 Oct 2004 15:40:24 -0400, Willy West [EMAIL PROTECTED] wrote:
  PS: is this a common problem/exercise in some class somewhere?  I keep
  seeing requests for help having to do with those exact strings of DNA
  data.  Is there a bunch of people working on DNA projects using Perl
  somewhere?  Or, is this some homework?
 
 bio-informatics is a big area in which Perl is involved...   there's even
 a book from O'reilly on the subject...
 
 also, a mailing-list is available...
 
 from http://lists.perl.org/   -
 
 bioperl-announce-l List is for people only interested in announcements
 of Bioperl code releases, updates and events.
 
 bioperl-l Unmoderated list for general discussion of bioperl modules,
 current  future efforts and related topics.
 
 -
 
 in the latter of the two lists, i counted about 80 messages in the first half
 of this month.
 
 hmm... i might join it... :)  might be fun!!
 
 --
 Willy
 http://www.hackswell.com/corenth
 
 

If what you say is true, then maybe Mike needs to take his questions
to those list?  I mean, if the problem he's describing is common and
the data format he's using is common, I bet it's been solved already.

Hey mike, have you searched on CPAN (search.cpan.org) for this?

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Willy West
On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen [EMAIL PROTECTED] wrote:
 

  bio-informatics is a big area in which Perl is involved...   there's even
  a book from O'reilly on the subject...

 
 If what you say is true, then maybe Mike needs to take his questions
 to those list?  I mean, if the problem he's describing is common and
 the data format he's using is common, I bet it's been solved already.
 
 Hey mike, have you searched on CPAN (search.cpan.org) for this?
 

that might be fine- but his question is fundamentally Perl in nature-
he may use the information for bio-informatics, but he is looking for
an answer regarding effective Perl use when dealing with strings-
that's classic Perl and a classic question for this list :)

i wish i could answer his question right off the bat, but i can't :/   

awell...



-- 
Willy
http://www.hackswell.com/corenth

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Errin Larsen
On Thu, 14 Oct 2004 16:08:44 -0400, Willy West [EMAIL PROTECTED] wrote:
 On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen [EMAIL PROTECTED] wrote:
 
 
   bio-informatics is a big area in which Perl is involved...   there's even
   a book from O'reilly on the subject...
 
  If what you say is true, then maybe Mike needs to take his questions
  to those list?  I mean, if the problem he's describing is common and
  the data format he's using is common, I bet it's been solved already.
 
  Hey mike, have you searched on CPAN (search.cpan.org) for this?
 
 
 that might be fine- but his question is fundamentally Perl in nature-
 he may use the information for bio-informatics, but he is looking for
 an answer regarding effective Perl use when dealing with strings-
 that's classic Perl and a classic question for this list :)
 
 i wish i could answer his question right off the bat, but i can't :/
 
 awell...
 

very true.  I've been playin' with this problem, it seems very
fun/interesting.  I was thinking this would be alot easier to play
with if the characters in the string were all in an array.  so I
played with the length() function and the substr() function to push
them all into an array in a loop.  something like this (prolly a
prettier/easier way to do this):

use Data::Dumper;

my $human = acgtt---cgatacg---acgact-t;
my @human;

for my $pos( 0 .. (length($human) -1)) {
push @human, substr($human, $pos, 1);
}

print Dumper([EMAIL PROTECTED];

The above produces the following output, which I was thinking might be
easier to work with for Mike's problem:

$VAR1 = [
  'a',
  'c',
  'g',
  't',
  't',
  '-',
  '-',
  '-',
  'c',
  'g',
  'a',
  't',
  'a',
  'c',
  'g',
  '-',
  '-',
  '-',
  'a',
  'c',
  'g',
  'a',
  'c',
  't',
  '-',
  '-',
  '-',
  '-',
  '-',
  't'
];

This is kinda fun!

--Errin

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Paul Johnson
On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:

 I have a set of data that looks something like the following:

 So, my problem is that I think I know some of the bits of code to put 
 into place the problem is I am getting lost on how to structure it all 
 together.

I like to try and figure things out on my own if I can, 
 so even pseudo code would be of great help!

Here's some code that works, incorporating most of your ideas.  It's
only a partial solution, but it think it covers the bit that was causing
you problems.  I'll present it without comment, but I'm sure people will
be able to explain anything you don't understand.

#!/usr/bin/perl -w

use strict;
use Data::Dumper;

while (DATA)
{
print, next unless /-/;
my %gaps;
push @{$gaps{length $1}}, pos() + 1 - length $1 while /(-+)/g;
print Dumper \%gaps;
}

__DATA__
human
acgtt---cgatacg---acgact-t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcgacgt

-- 
Paul Johnson - [EMAIL PROTECTED]
http://www.pjcj.net

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Michael Robeson
Here is as far as I can get with some real code and pseudo code. This 
is just to show you that I am actually trying. :-)

Pseudo - code

# open DNA sequence file

print Enter in the name of the DNA sequence file:\n;
chomp (my $dna_seq = STDIN);
open(DNA_SEQ, $dna_seq)
or die Can't open sequence file for input: $!;

# read sequence data into a hash - not sure if this is how I should do 
it?


my %sequences;
$/ = ''; # set to read in paragraph mode
print \n***Discovered the following DNA sequences:***\n;
while ( DNA_SEQ ) {
chomp;
next unless s/^\s*(.+)//;   
my $name = $1;  
s/\s//g;
$sequences{$name} = $_;
print $name found!\n;
}
close DNA_SEQ;
###
# search for and characterize gaps
###
somehow get data from hash and present it to a loop
%gaptype;
major pseudo code below
foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by 
sequence
	$position = pos($1);
	$gaplength = $1;
	$gaplength =~ tr/-//g;	# count the number of '-' for that particular
		# gap being processed
	$gaptype{gaplength}++;	# count the number of times each gap type 
appears
	
somehow get information from loop an print as seen below

OUTPUT_
Human
number of 3 base pair gaps: 2
at positions:   6, 16
number of 5 base pair gaps: 1
at positions:   25
.
.. and so on...
.
__DATA__
human
acgtt---cgatacg---acgact-t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcgacgt
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-14 Thread Errin Larsen
On Thu, 14 Oct 2004 23:23:48 +0200, Paul Johnson [EMAIL PROTECTED] wrote:
 On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
 
  I have a set of data that looks something like the following:
 
  So, my problem is that I think I know some of the bits of code to put
  into place the problem is I am getting lost on how to structure it all
  together.

  Hi Paul,
  I think you missed a critical part of Mike's post!:

 For now I am just trying to get my output to look like this:

Human
number of 3 base pair gaps: 2
  at positions:   6, 16
number of 5 base pair gaps: 1
   at positions:   25

Chimp
 and so on ...

I've put together something that will get the first part done
(counting base pair gaps, I guess is the point!)  Code is as follows:

use warnings;
use strict;

use Data::Dumper;

my %sequences = (
'human' = acgtt---cgatacg---acgact-t,
'chimp' = acgtt---cgatacg---acgact-t,
'mouse' = acgata---acgatcgacgt,
);
my %results;

foreach my $species( keys %sequences ) {
my $base_pair = 0;
my $base_pair_value;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
$base_pair = 1;
$base_pair_value .= $_;
} elsif( $base_pair ) {
$results{$species}{length($base_pair_value)} += 1;
$base_pair = 0;
$base_pair_value = undef;
}
}
}
 
foreach my $species( keys %results ) {
print $species = $sequences{$species}\n;
foreach my $base_pair( keys %{$results{$species}} ) {
printNumber of $base_pair base pair
gaps:\t$results{$species}{$base_pair}\n;
}
}

This will produce the following output:

# dnatest
human = acgtt---cgatacg---acgact-t
   Number of 3 base pair gaps:  2
   Number of 5 base pair gaps:  1
chimp = acgtt---cgatacg---acgact-t
   Number of 3 base pair gaps:  2
   Number of 5 base pair gaps:  1
mouse = acgata---acgatcgacgt
   Number of 4 base pair gaps:  1
   Number of 3 base pair gaps:  1

I put the sequence in the output for easy troubleshooting and
checking.  I'm still working on figuring out the positional data.

This IS fun.  I'll post when I've got it figured out
--Errin

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Paul Johnson
On Thu, Oct 14, 2004 at 04:50:30PM -0500, Errin Larsen wrote:

   Hi Paul,
   I think you missed a critical part of Mike's post!:

No, I didn't miss it.  I just thought it likely that Mike could get to
there from where I left off, so I gave a solution to the bit that seemed
most troublesome.

I'm still working on figuring out the positional data.

That data is included in the code I posted.

 This IS fun.  I'll post when I've got it figured out

I look forward to it.

-- 
Paul Johnson - [EMAIL PROTECTED]
http://www.pjcj.net

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: counting gaps in sequence data

2004-10-14 Thread Michael Robeson
Yeah, I have just submitted that same question verbatim to the bio-perl 
list. I am still running through some ideas though. I have both 
Bioinformatics perl books. They are not very effective teaching books.

The books spend too much time on using modules. Though while I 
understand the usefulness of not having to re-write code, it is a bad 
idea for beginners like me. Because re-writing code at first gives me a 
lot of practice. Some of the scripts in the books use like 3-5 modules, 
so it gets confusing on what is going on.

I mean the books are not useless, but they definitely are structured 
for a class with a teacher.

:-)
-Mike
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response



Re: counting gaps in sequence data

2004-10-14 Thread Philipp Traeder
On Thursday 14 October 2004 21:44, Michael Robeson wrote:
 Here is as far as I can get with some real code and pseudo code. This
 is just to show you that I am actually trying. :-)

Hi Michael,

this looks quite promising - I'll try to sketch a solution without giving 
everything away ;-)


 Pseudo - code

This is really important: Always start your scripts with

use warnings;
use strict;

This will help you avoiding errors.


 
 # open DNA sequence file
 

 print Enter in the name of the DNA sequence file:\n;
 chomp (my $dna_seq = STDIN);
 open(DNA_SEQ, $dna_seq)
   or die Can't open sequence file for input: $!;

 
 # read sequence data into a hash - not sure if this is how I should do
 it?
 

 my %sequences;

 $/ = ''; # set to read in paragraph mode

 print \n***Discovered the following DNA sequences:***\n;

 while ( DNA_SEQ ) {
   chomp;
   next unless s/^\s*(.+)//;
  my $name = $1;
  s/\s//g;
  $sequences{$name} = $_;
  print $name found!\n;
 }
 close DNA_SEQ;

This part is not working completely for me - I get the first sequence only.
I've never got used to working with the $/ operator, therefore I'd rewrite 
this part like this (without wanting to say that this would be easier than 
your approach):

print \n***Discovered the following DNA sequences:***\n;

# each line holds either a caption or a sequence
my $is_sequence_line = 0; # first line holds a caption
my $name;
while ( DNA_SEQ ) {
 chomp;
 
 if ($is_sequence_line) {   
# this line holds a sequence 
$sequences{$name} = $_;
print sequence for $name found!\n;
 }
 else {
# this line holds a name - remember it
print new name : $_\n;
$name = $_; 
 }
 
 $is_sequence_line = ! $is_sequence_line;
}
close DNA_SEQ;

print \n;



 ###
 # search for and characterize gaps
 ###

 somehow get data from hash and present it to a loop

 %gaptype;

 major pseudo code below
 foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
 sequence
   $position = pos($1);
   $gaplength = $1;
   $gaplength =~ tr/-//g;  # count the number of '-' for that particular
   # gap being processed
   $gaptype{gaplength}++;  # count the number of times each gap type
 appears

 somehow get information from loop an print as seen below


# first, we need to iterate over the sequences
while (my ($name, $sequence) = each(%sequences)) {
# now we can work on each sequence
print working on sequence $name\n;
print sequence : $sequence\n;
# find all gaps in the sequence
# (the g modifier means to search for all occurences)
while ($sequence =~ /(-+)/g) {
print \n;
# the special variables $`, $ and $' give you the
# text left of the match, the match itself, and the
# text right of the match.
# Using those variables and the length() function,
# you can print something like this:
print found the following gap : ' . $ . '\n;
print the text before the gap : ' . $` . '\n;
print the text after the gap : ' . $' . '\n;
print the text before the gap is  . length ($`) .  chars long.\n;
}
print \n;

}

I hope this might give you an approach for handling this - with the variables 
I printed, you should be able to fill your %gaptype hash like you sketched 
above (you find the length of the gap in length($), and the gap's position 
is the same as the length of the text left of the gap + 1).

 OUTPUT_

 Human
 number of 3 base pair gaps:   2
   at positions:   6, 16
 number of 5 base pair gaps:   1
   at positions:   25
 .
 .. and so on...

You could either (easy variant) print this information while running through 
the while loop or (a bit harder, but might be easier in the long run if you 
need to process the data further) store the amount and position of the gaps 
inside a two-dimensional hash of arrays - something like:

  my %gaps = 
( human = 
  { 3 = [6, 16],
5 = [25]
  },
  chimp =
  ...
)

It's a bit tricky to deal with structures like this, but once you got used to 
them, you must love them ;-)

Hope this helps.

Philipp

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response