subject:"Counting gaps in sequence data revisited."

Re: Counting gaps in sequence data revisited.

2004-10-26 Thread Zeus Odin


"Michael S. Robeson II" <[EMAIL PROTECTED]> wrote in message ...

> open(DNA_SEQ, $dna_seq)

Due to precedence, the parens are optional here.

> open(OUTFILE, ">indel_list_"."$dna_seq")
^
">indel_list_$dna_seq"  or   ">indel_list_".$dna_seq

> foreach (keys %sequences) {
> print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
> print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";

The > and < do not need to be escaped.

> $position{$gap_length} .= "$gap_pos"." \n";

"$gap_pos \n" or "$gap_pos\n"

> foreach my $key (@keys) {
> print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
> print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
> print "Positions:\n";
> print OUTFILE "Positions:\n";
> print "$position{$key}";
> print OUTFILE "$position{$key}";
> print "\n";
> print OUTFILE "\n";
> }

As a general rule, whenever you keep doing the same thing in multiple
places, think sub {}. See alternative below.

--

#!/usr/bin/perl
use warnings;
use strict;

# By pre-declaring print2mfh, you can call it without parentheses!
sub print2mfh;
my($animal, %gap);

# my $data = '/path/to/data.txt';
# open DATA, $data or die "Could not open $data: $!\n";
while () {
  />(\w+)/ and $animal=$1 and next;
  while ( /(-+)/g ) {
push @{ $gap{$animal}{length $1} }, $-[1]+1;
  }
}

die "No gaps found!" unless scalar keys %gap;
# if you are on unix or linux, use tee to print to multiple outputs
# See Perl Cookbook 7.18
# open TEE, "| tee outfile.txt" or die "Could not tee off!: $!\n";
# instead of print2mfh and OUT, you would do: print TEE "text to print";

open OUT, ">output.txt" or die "Could not open output file: $!\n";
print "***Discovered the following DNA sequences:***\n";
print "$_ found!\n" for sort keys %gap;

foreach my $animal(sort keys %gap) {
  print2mfh "\t\t>> $animal <<\n";
  foreach my $length(sort {$a <=> $b} keys %{ $gap{$animal} }) {
my $pos = [EMAIL PROTECTED] $gap{$animal}{$length} };
print2mfh "Indel size:\t$length\tTimes found:\t", scalar @$pos, "\n";
print2mfh join("\n", "Positions:", @$pos, "\n");
  }
}


sub print2mfh {
  # print to multiple file handles
  foreach my $fh(*STDOUT, *OUT) { print $fh @_ }
}

__DATA__
 >dog
atcg--acgat---act-ca
 >cat
acgt-acgtacgt-gt-agct-
 >mouse
---acgtacg-atcg---actgac-



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Counting gaps in sequence data revisited.

2004-10-17 Thread Michael S. Robeson II

I cleaned up the code a little. So, here it is for anyone interested:
#!usr/bin/perl
# By Michael S. Robeson II with the help from the folks at lernperl.org 
and bioperl.org
# 10/16/2004
# Last updated: 10/17/2004
# This script was made for the purpose of searching for indels (gaps) 
in aligned
# DNA or protein sequences that are in FASTA format. It tallys up all 
of the different
# size gaps within each sequence string. While it does this it counts 
the number of
# times each gap of a given size is represented in each sequence and at 
the same time
# reports all of the positions that that particular "gap-size" or indel 
appears.
# contact: [EMAIL PROTECTED] if you have questions or comments

use warnings;
use strict;
#
# Introduction
#
print "\n\t**Welcome to Mike Robeson's Gap-Counting Script!**\n
	A - Just be sure that your sequence alignment file
	is in FASTA format!
	B - Make sure there are no duplicate names within an individual file!
	C - Output file will be based on the name of the input file. It is 
named
	by appending \'indel_list_\' to the name of your input file.\n\n";

###
# Open Sequence Data & OUTFILE
###
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = );
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
open(OUTFILE, ">indel_list_"."$dna_seq")
or die "Can't open outfile: $!\n";
#
# Read sequence data into a hash
#
my %sequences;
$/ = '>';
print "\n***Discovered the following DNA sequences:***\n";
while (  ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;  
s/\s//g;
$sequences{$name} = $_; 
print "$name found!\n";   

print"\n";
}
close DNA_SEQ;
##
# Iterate over gaps and write to file
##
foreach (keys %sequences) { 
print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";   
print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";   

my $dna = $sequences{$_};   
my %gap_data;   
my %position;   
while ($dna =~ /(\-+)/g) {
my $gap_length = length $1;
my $gap_pos = pos ($dna) - $gap_length + 1;
$gap_data{$gap_length}++;
$position{$gap_length} .= "$gap_pos"." \n";
}

my @indels = keys (%gap_data);  
my @keys = sort { $a <=> $b} @indels; 
	foreach my $key (@keys) {
		print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
		print "Positions:\n";
		print OUTFILE "Positions:\n";
		print "$position{$key}";
		print OUTFILE "$position{$key}";
		print "\n";
		print OUTFILE "\n";
		}
} 

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Counting gaps in sequence data revisited.

2004-10-17 Thread Michael S. Robeson II

I just wanted to thank everyone for their help and suggestions. This is 
the final full working code to count continuos gaps in a every sequence 
in a multi-sequence FASTA file. It may not be elegant but it is fast 
and works well. Sorry for the long post but I wanted to share this with 
those that do any DNA work. :-)

I show in order: the format of the input data, the output of the input 
data, and finally the working script. I have yet to add comments into 
the code but I am sure many of you veterans will figure it out.

-Thanks again all! As always comments, questions or suggestions are 
welcome!
-Mike

 INPUT 
>dog
atcg--acgat---act-ca
>cat
acgt-acgtacgt-gt-agct-
>mouse
---acgtacg-atcg---actgac-
--- OUTPUT -
***Discovered the following DNA sequences:***
dog found!
cat found!
mouse found!
>> mouse <<
Indel size: 1   Times found:2
Positions:
11
25
Indel size: 3   Times found:2
Positions:
1
16
>> cat <<
Indel size: 1   Times found:4
Positions:
5
18
21
26
Indel size: 4   Times found:1
Positions:
10
>> dog <<
Indel size: 1   Times found:1
Positions:
18
Indel size: 2   Times found:1
Positions:
5
Indel size: 3   Times found:1
Positions:
12
Indel size: 4   Times found:1
Positions:
21
--- Script ---
#!usr/bin/perl
# By Michael S. Robeson II, with the help of friends at lernperl.org 
and bioperl.org! :-)
# 10/16/2004

use warnings;
use strict;
###
# Open Sequence Data & OUTFILE
###
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = );
open(DNA_SEQ, $dna_seq)
or die "Can't open file: $!\n";
open(OUTFILE, ">indel_list_"."$dna_seq")
or die "Can't open outfile: $!\n";

# Read sequence data into a hash

my %sequences;
$/ = '>';
print "\n***Discovered the following DNA sequences:***\n";
while (  ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;
##
# iterate over gaps and write to file
##
foreach (keys %sequences) {
print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
my $dna = $sequences{$_};
my %gap_data;
my %position;
while ($dna =~ /(\-+)/g) {
my $gap_pos = pos ($dna) - length($&) + 1;
my $gap_length = length $1; #$1 =~ tr/\-+//
$gap_data{$gap_length}++;
$position{$gap_length} .= "$gap_pos"." \n";
}

my @indels = keys (%gap_data);
my @keys = sort { $a <=> $b} @indels;

foreach my $key (@keys) {
print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
print "Positions:\n";
print OUTFILE "Positions:\n";
print "$position{$key}";
print OUTFILE "$position{$key}";
print "\n";
print OUTFILE "\n";
}
# Can replace the last "foreach loop" above with the while loop
# below to do the same thing. Only Gap sizes will not be sorted.
# nor is it set up to print to a file
#   
# while (my ($key, $vlaue)  = each (%gap_data)) {
#   print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
#   print "Positions:\n";
#   print "$position{$key}";
#   print "\n\n";
# }
} 

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Counting gaps in sequence data revisited.

Re: Counting gaps in sequence data revisited.

Counting gaps in sequence data revisited.

3 matches

Site Navigation

Mail list logo

Footer information