Re: counting gaps in sequence data

Philipp Traeder Thu, 14 Oct 2004 15:57:18 -0700

On Thursday 14 October 2004 21:44, Michael Robeson wrote:
> Here is as far as I can get with some real code and pseudo code. This
> is just to show you that I am actually trying. :-)


Hi Michael,

this looks quite promising - I'll try to sketch a solution without giving 
everything away ;-)

>
> Pseudo - code

This is really important: Always start your scripts with

use warnings;
use strict;

This will help you avoiding errors.

>
> ################################
> # open DNA sequence file
> ################################
>
> print "Enter in the name of the DNA sequence file:\n";
> chomp (my $dna_seq = <STDIN>);
> open(DNA_SEQ, $dna_seq)
>       or die "Can't open sequence file for input: $!";
>
> ########################################################
> # read sequence data into a hash - not sure if this is how I should do
> it?
> ########################################################
>
> my %sequences;
>
> $/ = '>'; # set to read in paragraph mode
>
> print "\n***Discovered the following DNA sequences:***\n";
>
> while ( <DNA_SEQ> ) {
>       chomp;
>       next unless s/^\s*(.+)//;
>      my $name = $1;
>      s/\s//g;
>      $sequences{$name} = $_;
>      print "$name found!\n";
> }
> close DNA_SEQ;

This part is not working completely for me - I get the first sequence only.
I've never got used to working with the $/ operator, therefore I'd rewrite 
this part like this (without wanting to say that this would be easier than 
your approach):

print "\n***Discovered the following DNA sequences:***\n";

# each line holds either a caption or a sequence
my $is_sequence_line = 0; # first line holds a caption
my $name;
while ( <DNA_SEQ> ) {
     chomp;    
     
     if ($is_sequence_line) {   
        # this line holds a sequence         
        $sequences{$name} = $_;
        print "sequence for $name found!\n";
     }
     else {
        # this line holds a name - remember it
        print "new name : $_\n";
        $name = $_;     
     }
     
     $is_sequence_line = ! $is_sequence_line;
}
close DNA_SEQ;

print "\n";

>
>
> ###################################
> # search for and characterize gaps
> ###################################
>
> <<somehow get data from hash and present it to a loop>>
>
> %gaptype;
>
> <<major pseudo code below>>
> foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
> sequence
>       $position = pos($1);
>       $gaplength = $1;
>       $gaplength =~ tr/-//g;  # count the number of '-' for that particular
>                                               # gap being processed
>       $gaptype{gaplength}++;  # count the number of times each gap type
> appears
>
> <<somehow get information from loop an print as seen below>>
>

# first, we need to iterate over the sequences
while (my ($name, $sequence) = each(%sequences)) {
        # now we can work on each sequence
        print "working on sequence $name\n";
        print "sequence : $sequence\n";
        # find all gaps in the sequence
        # (the g modifier means to search for all occurences)
        while ($sequence =~ /(-+)/g) {
            print "\n";
            # the special variables $`, $& and $' give you the
            # text left of the match, the match itself, and the
            # text right of the match.
            # Using those variables and the length() function,
            # you can print something like this:
            print "found the following gap : '" . $& . "'\n";
            print "the text before the gap : '" . $` . "'\n";
            print "the text after the gap : '" . $' . "'\n";
            print "the text before the gap is " . length ($`) . " chars long.\n";
        }
        print "\n";
        
}

I hope this might give you an approach for handling this - with the variables 
I printed, you should be able to fill your %gaptype hash like you sketched 
above (you find the length of the gap in length($&), and the gap's position 
is the same as the "length of the text left of the gap" + 1).

> ____OUTPUT_____
>
> Human
> number of 3 base pair gaps:           2
>                       at positions:           6, 16
> number of 5 base pair gaps:           1
>                       at positions:           25
> .
> .. and so on...

You could either (easy variant) print this information while running through 
the while loop or (a bit harder, but might be easier in the long run if you 
need to process the data further) store the amount and position of the gaps 
inside a two-dimensional hash of arrays - something like:

  my %gaps = 
    ( human => 
      { 3 => [6, 16],
        5 => [25]
      },
      chimp =>
      ...
    )

It's a bit tricky to deal with structures like this, but once you got used to 
them, you must love them ;-)

Hope this helps.

Philipp

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: counting gaps in sequence data

Reply via email to