Re: Counting gaps in sequence data revisited.
Michael S. Robeson II [EMAIL PROTECTED] wrote in message ... open(DNA_SEQ, $dna_seq) Due to precedence, the parens are optional here. open(OUTFILE, indel_list_.$dna_seq) ^ indel_list_$dna_seq or indel_list_.$dna_seq foreach (keys %sequences) { print \t\t\\ $_ \\\n; print OUTFILE \\ $_ \\\n; The and do not need to be escaped. $position{$gap_length} .= $gap_pos. \n; $gap_pos \n or $gap_pos\n foreach my $key (@keys) { print Indel size:\t$key\tTimes found:\t$gap_data{$key}\n; print OUTFILE Indel size:\t$key\tTimes found:\t$gap_data{$key}\n; print Positions:\n; print OUTFILE Positions:\n; print $position{$key}; print OUTFILE $position{$key}; print \n; print OUTFILE \n; } As a general rule, whenever you keep doing the same thing in multiple places, think sub {}. See alternative below. -- #!/usr/bin/perl use warnings; use strict; # By pre-declaring print2mfh, you can call it without parentheses! sub print2mfh; my($animal, %gap); # my $data = '/path/to/data.txt'; # open DATA, $data or die Could not open $data: $!\n; while (DATA) { /(\w+)/ and $animal=$1 and next; while ( /(-+)/g ) { push @{ $gap{$animal}{length $1} }, $-[1]+1; } } die No gaps found! unless scalar keys %gap; # if you are on unix or linux, use tee to print to multiple outputs # See Perl Cookbook 7.18 # open TEE, | tee outfile.txt or die Could not tee off!: $!\n; # instead of print2mfh and OUT, you would do: print TEE text to print; open OUT, output.txt or die Could not open output file: $!\n; print ***Discovered the following DNA sequences:***\n; print $_ found!\n for sort keys %gap; foreach my $animal(sort keys %gap) { print2mfh \t\t $animal \n; foreach my $length(sort {$a = $b} keys %{ $gap{$animal} }) { my $pos = [EMAIL PROTECTED] $gap{$animal}{$length} }; print2mfh Indel size:\t$length\tTimes found:\t, scalar @$pos, \n; print2mfh join(\n, Positions:, @$pos, \n); } } sub print2mfh { # print to multiple file handles foreach my $fh(*STDOUT, *OUT) { print $fh @_ } } __DATA__ dog atcg--acgat---act-ca cat acgt-acgtacgt-gt-agct- mouse ---acgtacg-atcg---actgac- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
This is a really cool problem. See solution below. Michael, next time post some code please. Thanks, ZO -- #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my(%gap, $animal); while (DATA) { if (/(\w+)/) { $animal = $1; } else { while (/(-+)/) { $gap{$animal}{ length $1 } += s/-+//; } } } print Dumper \%gap; __DATA__ human acgtt---cgatacg---acgact-t chimp acgtacgatac---actgca---ac mouse acgata---acgatcgacgt -- bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... If what you say is true, then maybe Mike needs to take his questions to those list? I mean, if the problem he's describing is common and the data format he's using is common, I bet it's been solved already. Hey mike, have you searched on CPAN (search.cpan.org) for this? that might be fine- but his question is fundamentally Perl in nature- he may use the information for bio-informatics, but he is looking for an answer regarding effective Perl use when dealing with strings- that's classic Perl and a classic question for this list :) i wish i could answer his question right off the bat, but i can't :/ awell... -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Counting gaps in sequence data revisited.
I cleaned up the code a little. So, here it is for anyone interested: #!usr/bin/perl # By Michael S. Robeson II with the help from the folks at lernperl.org and bioperl.org # 10/16/2004 # Last updated: 10/17/2004 # This script was made for the purpose of searching for indels (gaps) in aligned # DNA or protein sequences that are in FASTA format. It tallys up all of the different # size gaps within each sequence string. While it does this it counts the number of # times each gap of a given size is represented in each sequence and at the same time # reports all of the positions that that particular gap-size or indel appears. # contact: [EMAIL PROTECTED] if you have questions or comments use warnings; use strict; # # Introduction # print \n\t**Welcome to Mike Robeson's Gap-Counting Script!**\n A - Just be sure that your sequence alignment file is in FASTA format! B - Make sure there are no duplicate names within an individual file! C - Output file will be based on the name of the input file. It is named by appending \'indel_list_\' to the name of your input file.\n\n; ### # Open Sequence Data OUTFILE ### print Enter in the name of the DNA sequence file:\n; chomp (my $dna_seq = STDIN); open(DNA_SEQ, $dna_seq) or die Can't open file: $!\n; open(OUTFILE, indel_list_.$dna_seq) or die Can't open outfile: $!\n; # # Read sequence data into a hash # my %sequences; $/ = ''; print \n***Discovered the following DNA sequences:***\n; while ( DNA_SEQ ) { chomp; next unless s/^\s*(.+)//; my $name = $1; s/\s//g; $sequences{$name} = $_; print $name found!\n; print\n; } close DNA_SEQ; ## # Iterate over gaps and write to file ## foreach (keys %sequences) { print \t\t\\ $_ \\\n; print OUTFILE \\ $_ \\\n; my $dna = $sequences{$_}; my %gap_data; my %position; while ($dna =~ /(\-+)/g) { my $gap_length = length $1; my $gap_pos = pos ($dna) - $gap_length + 1; $gap_data{$gap_length}++; $position{$gap_length} .= $gap_pos. \n; } my @indels = keys (%gap_data); my @keys = sort { $a = $b} @indels; foreach my $key (@keys) { print Indel size:\t$key\tTimes found:\t$gap_data{$key}\n; print OUTFILE Indel size:\t$key\tTimes found:\t$gap_data{$key}\n; print Positions:\n; print OUTFILE Positions:\n; print $position{$key}; print OUTFILE $position{$key}; print \n; print OUTFILE \n; } } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson [EMAIL PROTECTED] wrote: Yeah, I have just submitted that same question verbatim to the bio-perl list. I am still running through some ideas though. I have both Bioinformatics perl books. They are not very effective teaching books. The books spend too much time on using modules. Though while I understand the usefulness of not having to re-write code, it is a bad idea for beginners like me. Because re-writing code at first gives me a lot of practice. Some of the scripts in the books use like 3-5 modules, so it gets confusing on what is going on. I mean the books are not useless, but they definitely are structured for a class with a teacher. :-) -Mike Hi again, Mike! I've thrown together the following code. I have not commented this! If you have some questions, just ask. I hard coded the sequences for my ease-of-use. It looked to me like you have figured out how to grab the sequences out of a file and throw them in a hash. This code uses some deep nested references, and therefore, some crazy dereferences. Have fun with it, I know I did! Things that might look weird: check out perldoc -f split for info on using a null-string to split with (That's were I found it!) and of course perldoc perlref for all the deep nested references and dereferencing stuff! I'm currently reading Learning Perl Objects, References Modules by Randal Schwartz. I highly recommend it. It helped a lot in this exercise. Here's the code: use warnings; use strict; my %sequences = ( 'Human' = acgtt---cgatacg---acgact-t, 'Chimp' = acgtt---cgatacg---acgact-t, 'Mouse' = acgata---acgatcgacgt, ); my %results; foreach my $species( keys %sequences ) { my $is_base_pair_gap = 0; my $base_pair_gap; my $base_pair_gap_pos; my $position = 1; foreach( split( / */, $sequences{$species} )) { if( /-/ ) { unless( $is_base_pair_gap ) { $base_pair_gap_pos = $position; } $is_base_pair_gap = 1; $base_pair_gap .= $_; } elsif( $is_base_pair_gap ) { push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos; $is_base_pair_gap = 0; $base_pair_gap = undef; } $position++; } } foreach my $species( keys %results ) { print $species:\n; foreach my $base_pair_gap( keys %{$results{$species}} ) { printNumber of $base_pair_gap base pair gaps:\t, scalar( @{$results{$species}{$base_pair_gap}}), \n; print at position(s) , join( ',', @{$results{$species}{$base_pair_gap}} ), .\n; } print \n; } The heart of this code is this line: push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos; there is a %results hash which has keys that are the different species, and values that point to another hash. THAT hash (the inner hash) has keys that are the length of the base-pair-gaps, and values that point to an array. The array holds a list of the positions of those base-pair gaps! The first base pair gap in the human sequence is '---' at the 6th character. That looks like this (warning: pseudo code for clarity!) %results-{'Human'}-{ 3 }-[6] When we find the second '---' gap, we add it's position to the array: %results-{'Human'}-{ 3 }-[6,16] Then, we find a new base-pair-gap ('-') so we add a new key to inner hash: %results-{'Human'}-{ 3 }-[6,16] -{ 5 }-[25] Next, we move on to the next species ... %results-{'Human'}-{ 3 }-[6,16] -{ 5 }-[25] -{'Mouse'}-{ 3 }-[7] So, finally, with Data::Dumper, we can see the %results hash when the code is done processing the sequence: %results = { 'Human' = { '3' = [ 6, 16 ], '5' = [ 25 ] }, 'Mouse' = { '4' = [ 17 ], '3' = [ 7 ] }, 'Chimp' = { '3' = [ 6, 16 ], '5' = [ 25 ] } }; I hope this is helpful! This really was a lot of fun. --Errin -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: counting gaps in sequence data
Errin, Thanks so much! I will spend the weekend going over what you've posted. Looks like I will learn a lot from this post alone. This stuff is so addictive. I can spend hours doing this and not realize it. If I am successful or not is another story! :-) I'll definitely let you know if I have any trouble. -Cheers! -Mike On Oct 15, 2004, at 7:22 AM, Errin Larsen wrote: On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson [EMAIL PROTECTED] wrote: Yeah, I have just submitted that same question verbatim to the bio-perl list. I am still running through some ideas though. I have both Bioinformatics perl books. They are not very effective teaching books. The books spend too much time on using modules. Though while I understand the usefulness of not having to re-write code, it is a bad idea for beginners like me. Because re-writing code at first gives me a lot of practice. Some of the scripts in the books use like 3-5 modules, so it gets confusing on what is going on. I mean the books are not useless, but they definitely are structured for a class with a teacher. :-) -Mike Hi again, Mike! I've thrown together the following code. I have not commented this! If you have some questions, just ask. I hard coded the sequences for my ease-of-use. It looked to me like you have figured out how to grab the sequences out of a file and throw them in a hash. This code uses some deep nested references, and therefore, some crazy dereferences. Have fun with it, I know I did! Things that might look weird: check out perldoc -f split for info on using a null-string to split with (That's were I found it!) and of course perldoc perlref for all the deep nested references and dereferencing stuff! I'm currently reading Learning Perl Objects, References Modules by Randal Schwartz. I highly recommend it. It helped a lot in this exercise. Here's the code: use warnings; use strict; my %sequences = ( 'Human' = acgtt---cgatacg---acgact-t, 'Chimp' = acgtt---cgatacg---acgact-t, 'Mouse' = acgata---acgatcgacgt, ); my %results; foreach my $species( keys %sequences ) { my $is_base_pair_gap = 0; my $base_pair_gap; my $base_pair_gap_pos; my $position = 1; foreach( split( / */, $sequences{$species} )) { if( /-/ ) { unless( $is_base_pair_gap ) { $base_pair_gap_pos = $position; } $is_base_pair_gap = 1; $base_pair_gap .= $_; } elsif( $is_base_pair_gap ) { push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos; $is_base_pair_gap = 0; $base_pair_gap = undef; } $position++; } } foreach my $species( keys %results ) { print $species:\n; foreach my $base_pair_gap( keys %{$results{$species}} ) { printNumber of $base_pair_gap base pair gaps:\t, scalar( @{$results{$species}{$base_pair_gap}}), \n; print at position(s) , join( ',', @{$results{$species}{$base_pair_gap}} ), .\n; } print \n; } The heart of this code is this line: push @{$results{$species}{length($base_pair_gap)}}, $base_pair_gap_pos; there is a %results hash which has keys that are the different species, and values that point to another hash. THAT hash (the inner hash) has keys that are the length of the base-pair-gaps, and values that point to an array. The array holds a list of the positions of those base-pair gaps! The first base pair gap in the human sequence is '---' at the 6th character. That looks like this (warning: pseudo code for clarity!) %results-{'Human'}-{ 3 }-[6] When we find the second '---' gap, we add it's position to the array: %results-{'Human'}-{ 3 }-[6,16] Then, we find a new base-pair-gap ('-') so we add a new key to inner hash: %results-{'Human'}-{ 3 }-[6,16] -{ 5 }-[25] Next, we move on to the next species ... %results-{'Human'}-{ 3 }-[6,16] -{ 5 }-[25] -{'Mouse'}-{ 3 }-[7] So, finally, with Data::Dumper, we can see the %results hash when the code is done processing the sequence: %results = { 'Human' = { '3' = [ 6, 16 ], '5' = [ 25 ] }, 'Mouse' = { '4' = [ 17 ], '3' = [ 7 ] }, 'Chimp' = { '3' = [ 6, 16
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 11:02:06 -0600, Michael Robeson [EMAIL PROTECTED] wrote: I have a set of data that looks something like the following: SNIP So, any suggestions would be greatly appreciated. If anyone can help me out with all or even just bits of this I would greatly appreciate it. This should help me get started on some more advanced parsing I need to do after this. I like to try and figure things out on my own if I can, so even pseudo code would be of great help! -Thanks -Mike Hi Mike, This list works best if you show us some code you have tried and then we can help you troubleshoot it. It lets us know that you've already tried to solve your problem and aren't looking for a free scripting service! Show us some of that code you talked about and we'll help you out --Errin PS: is this a common problem/exercise in some class somewhere? I keep seeing requests for help having to do with those exact strings of DNA data. Is there a bunch of people working on DNA projects using Perl somewhere? Or, is this some homework? -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
PS: is this a common problem/exercise in some class somewhere? I keep seeing requests for help having to do with those exact strings of DNA data. Is there a bunch of people working on DNA projects using Perl somewhere? Or, is this some homework? bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... also, a mailing-list is available... from http://lists.perl.org/ - bioperl-announce-l List is for people only interested in announcements of Bioperl code releases, updates and events. bioperl-l Unmoderated list for general discussion of bioperl modules, current future efforts and related topics. - in the latter of the two lists, i counted about 80 messages in the first half of this month. hmm... i might join it... :) might be fun!! -- Willy http://www.hackswell.com/corenth -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 15:40:24 -0400, Willy West [EMAIL PROTECTED] wrote: PS: is this a common problem/exercise in some class somewhere? I keep seeing requests for help having to do with those exact strings of DNA data. Is there a bunch of people working on DNA projects using Perl somewhere? Or, is this some homework? bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... also, a mailing-list is available... from http://lists.perl.org/ - bioperl-announce-l List is for people only interested in announcements of Bioperl code releases, updates and events. bioperl-l Unmoderated list for general discussion of bioperl modules, current future efforts and related topics. - in the latter of the two lists, i counted about 80 messages in the first half of this month. hmm... i might join it... :) might be fun!! -- Willy http://www.hackswell.com/corenth If what you say is true, then maybe Mike needs to take his questions to those list? I mean, if the problem he's describing is common and the data format he's using is common, I bet it's been solved already. Hey mike, have you searched on CPAN (search.cpan.org) for this? -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen [EMAIL PROTECTED] wrote: bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... If what you say is true, then maybe Mike needs to take his questions to those list? I mean, if the problem he's describing is common and the data format he's using is common, I bet it's been solved already. Hey mike, have you searched on CPAN (search.cpan.org) for this? that might be fine- but his question is fundamentally Perl in nature- he may use the information for bio-informatics, but he is looking for an answer regarding effective Perl use when dealing with strings- that's classic Perl and a classic question for this list :) i wish i could answer his question right off the bat, but i can't :/ awell... -- Willy http://www.hackswell.com/corenth -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 16:08:44 -0400, Willy West [EMAIL PROTECTED] wrote: On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen [EMAIL PROTECTED] wrote: bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... If what you say is true, then maybe Mike needs to take his questions to those list? I mean, if the problem he's describing is common and the data format he's using is common, I bet it's been solved already. Hey mike, have you searched on CPAN (search.cpan.org) for this? that might be fine- but his question is fundamentally Perl in nature- he may use the information for bio-informatics, but he is looking for an answer regarding effective Perl use when dealing with strings- that's classic Perl and a classic question for this list :) i wish i could answer his question right off the bat, but i can't :/ awell... very true. I've been playin' with this problem, it seems very fun/interesting. I was thinking this would be alot easier to play with if the characters in the string were all in an array. so I played with the length() function and the substr() function to push them all into an array in a loop. something like this (prolly a prettier/easier way to do this): use Data::Dumper; my $human = acgtt---cgatacg---acgact-t; my @human; for my $pos( 0 .. (length($human) -1)) { push @human, substr($human, $pos, 1); } print Dumper([EMAIL PROTECTED]; The above produces the following output, which I was thinking might be easier to work with for Mike's problem: $VAR1 = [ 'a', 'c', 'g', 't', 't', '-', '-', '-', 'c', 'g', 'a', 't', 'a', 'c', 'g', '-', '-', '-', 'a', 'c', 'g', 'a', 'c', 't', '-', '-', '-', '-', '-', 't' ]; This is kinda fun! --Errin -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote: I have a set of data that looks something like the following: So, my problem is that I think I know some of the bits of code to put into place the problem is I am getting lost on how to structure it all together. I like to try and figure things out on my own if I can, so even pseudo code would be of great help! Here's some code that works, incorporating most of your ideas. It's only a partial solution, but it think it covers the bit that was causing you problems. I'll present it without comment, but I'm sure people will be able to explain anything you don't understand. #!/usr/bin/perl -w use strict; use Data::Dumper; while (DATA) { print, next unless /-/; my %gaps; push @{$gaps{length $1}}, pos() + 1 - length $1 while /(-+)/g; print Dumper \%gaps; } __DATA__ human acgtt---cgatacg---acgact-t chimp acgtacgatac---actgca---ac mouse acgata---acgatcgacgt -- Paul Johnson - [EMAIL PROTECTED] http://www.pjcj.net -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
Here is as far as I can get with some real code and pseudo code. This is just to show you that I am actually trying. :-) Pseudo - code # open DNA sequence file print Enter in the name of the DNA sequence file:\n; chomp (my $dna_seq = STDIN); open(DNA_SEQ, $dna_seq) or die Can't open sequence file for input: $!; # read sequence data into a hash - not sure if this is how I should do it? my %sequences; $/ = ''; # set to read in paragraph mode print \n***Discovered the following DNA sequences:***\n; while ( DNA_SEQ ) { chomp; next unless s/^\s*(.+)//; my $name = $1; s/\s//g; $sequences{$name} = $_; print $name found!\n; } close DNA_SEQ; ### # search for and characterize gaps ### somehow get data from hash and present it to a loop %gaptype; major pseudo code below foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by sequence $position = pos($1); $gaplength = $1; $gaplength =~ tr/-//g; # count the number of '-' for that particular # gap being processed $gaptype{gaplength}++; # count the number of times each gap type appears somehow get information from loop an print as seen below OUTPUT_ Human number of 3 base pair gaps: 2 at positions: 6, 16 number of 5 base pair gaps: 1 at positions: 25 . .. and so on... . __DATA__ human acgtt---cgatacg---acgact-t chimp acgtacgatac---actgca---ac mouse acgata---acgatcgacgt -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, 14 Oct 2004 23:23:48 +0200, Paul Johnson [EMAIL PROTECTED] wrote: On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote: I have a set of data that looks something like the following: So, my problem is that I think I know some of the bits of code to put into place the problem is I am getting lost on how to structure it all together. Hi Paul, I think you missed a critical part of Mike's post!: For now I am just trying to get my output to look like this: Human number of 3 base pair gaps: 2 at positions: 6, 16 number of 5 base pair gaps: 1 at positions: 25 Chimp and so on ... I've put together something that will get the first part done (counting base pair gaps, I guess is the point!) Code is as follows: use warnings; use strict; use Data::Dumper; my %sequences = ( 'human' = acgtt---cgatacg---acgact-t, 'chimp' = acgtt---cgatacg---acgact-t, 'mouse' = acgata---acgatcgacgt, ); my %results; foreach my $species( keys %sequences ) { my $base_pair = 0; my $base_pair_value; foreach( split( / */, $sequences{$species} )) { if( /-/ ) { $base_pair = 1; $base_pair_value .= $_; } elsif( $base_pair ) { $results{$species}{length($base_pair_value)} += 1; $base_pair = 0; $base_pair_value = undef; } } } foreach my $species( keys %results ) { print $species = $sequences{$species}\n; foreach my $base_pair( keys %{$results{$species}} ) { printNumber of $base_pair base pair gaps:\t$results{$species}{$base_pair}\n; } } This will produce the following output: # dnatest human = acgtt---cgatacg---acgact-t Number of 3 base pair gaps: 2 Number of 5 base pair gaps: 1 chimp = acgtt---cgatacg---acgact-t Number of 3 base pair gaps: 2 Number of 5 base pair gaps: 1 mouse = acgata---acgatcgacgt Number of 4 base pair gaps: 1 Number of 3 base pair gaps: 1 I put the sequence in the output for easy troubleshooting and checking. I'm still working on figuring out the positional data. This IS fun. I'll post when I've got it figured out --Errin -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thu, Oct 14, 2004 at 04:50:30PM -0500, Errin Larsen wrote: Hi Paul, I think you missed a critical part of Mike's post!: No, I didn't miss it. I just thought it likely that Mike could get to there from where I left off, so I gave a solution to the bit that seemed most troublesome. I'm still working on figuring out the positional data. That data is included in the code I posted. This IS fun. I'll post when I've got it figured out I look forward to it. -- Paul Johnson - [EMAIL PROTECTED] http://www.pjcj.net -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
Yeah, I have just submitted that same question verbatim to the bio-perl list. I am still running through some ideas though. I have both Bioinformatics perl books. They are not very effective teaching books. The books spend too much time on using modules. Though while I understand the usefulness of not having to re-write code, it is a bad idea for beginners like me. Because re-writing code at first gives me a lot of practice. Some of the scripts in the books use like 3-5 modules, so it gets confusing on what is going on. I mean the books are not useless, but they definitely are structured for a class with a teacher. :-) -Mike -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: counting gaps in sequence data
On Thursday 14 October 2004 21:44, Michael Robeson wrote: Here is as far as I can get with some real code and pseudo code. This is just to show you that I am actually trying. :-) Hi Michael, this looks quite promising - I'll try to sketch a solution without giving everything away ;-) Pseudo - code This is really important: Always start your scripts with use warnings; use strict; This will help you avoiding errors. # open DNA sequence file print Enter in the name of the DNA sequence file:\n; chomp (my $dna_seq = STDIN); open(DNA_SEQ, $dna_seq) or die Can't open sequence file for input: $!; # read sequence data into a hash - not sure if this is how I should do it? my %sequences; $/ = ''; # set to read in paragraph mode print \n***Discovered the following DNA sequences:***\n; while ( DNA_SEQ ) { chomp; next unless s/^\s*(.+)//; my $name = $1; s/\s//g; $sequences{$name} = $_; print $name found!\n; } close DNA_SEQ; This part is not working completely for me - I get the first sequence only. I've never got used to working with the $/ operator, therefore I'd rewrite this part like this (without wanting to say that this would be easier than your approach): print \n***Discovered the following DNA sequences:***\n; # each line holds either a caption or a sequence my $is_sequence_line = 0; # first line holds a caption my $name; while ( DNA_SEQ ) { chomp; if ($is_sequence_line) { # this line holds a sequence $sequences{$name} = $_; print sequence for $name found!\n; } else { # this line holds a name - remember it print new name : $_\n; $name = $_; } $is_sequence_line = ! $is_sequence_line; } close DNA_SEQ; print \n; ### # search for and characterize gaps ### somehow get data from hash and present it to a loop %gaptype; major pseudo code below foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by sequence $position = pos($1); $gaplength = $1; $gaplength =~ tr/-//g; # count the number of '-' for that particular # gap being processed $gaptype{gaplength}++; # count the number of times each gap type appears somehow get information from loop an print as seen below # first, we need to iterate over the sequences while (my ($name, $sequence) = each(%sequences)) { # now we can work on each sequence print working on sequence $name\n; print sequence : $sequence\n; # find all gaps in the sequence # (the g modifier means to search for all occurences) while ($sequence =~ /(-+)/g) { print \n; # the special variables $`, $ and $' give you the # text left of the match, the match itself, and the # text right of the match. # Using those variables and the length() function, # you can print something like this: print found the following gap : ' . $ . '\n; print the text before the gap : ' . $` . '\n; print the text after the gap : ' . $' . '\n; print the text before the gap is . length ($`) . chars long.\n; } print \n; } I hope this might give you an approach for handling this - with the variables I printed, you should be able to fill your %gaptype hash like you sketched above (you find the length of the gap in length($), and the gap's position is the same as the length of the text left of the gap + 1). OUTPUT_ Human number of 3 base pair gaps: 2 at positions: 6, 16 number of 5 base pair gaps: 1 at positions: 25 . .. and so on... You could either (easy variant) print this information while running through the while loop or (a bit harder, but might be easier in the long run if you need to process the data further) store the amount and position of the gaps inside a two-dimensional hash of arrays - something like: my %gaps = ( human = { 3 = [6, 16], 5 = [25] }, chimp = ... ) It's a bit tricky to deal with structures like this, but once you got used to them, you must love them ;-) Hope this helps. Philipp -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response