Christiane Nerz wrote:
>
> Good morning!
Hello,
> Thx for your help..
>
> The problem seems to lay in "filling the hash".
> But I can't see why.
>
> I want to compare two fasta-files, more precisely the IDs of two sets
> of sequences.
I take it that the ID is everything between ' >' and "\n"?
> Each file looks like:
>
> >gi|13699918|dbj|BAB41217.1|.....
> MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVG
> YEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLA
> VAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDA......
> >gi|13699919|dbj|BAB41218.1|....
> QFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQ
> ENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAE....
> >gi|13699920|dbj|BAB41219.1|.....
> MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDA
> GSFLIIHQGEQ
> >gi|13699921|dbj|BAB41220.1|...
> MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKI
> EGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELG
> QISAVYLNDLAQYQRILKQKNNYLKQLQLGQ
>
> The whole code is as follows:
>
> #!/usr/local/bin/perl
> #
> ###############################################################################
> # add_IDs_by_pattern_matching
> #
> # Input: Two fastafiles.
> # (Some sequences are in both files, but labeled with different IDS)
> # Output: One fasta-file, sequences labeled with both IDs.
Here you seem to say that you want to compare the sequences and not the
IDs?
> ###############################################################################
>
> use strict;
> my $fastafilename1 = '';
> my $fastafilename2 = '';
> my %hash_fasta1 = '';
> my %hash_fasta2 = '';
> my $key_fasta1 = '';
> my $key_fasta2 = '';
> my $value_fasta1 = '';
> my $value_fasta2 = '';
> my %hash_pattern = '';
> my $key_pattern = '';
> my $value_pattern = '';
>
> print "Please type the name of the first file:\n";
> chomp ($fastafilename1 = <STDIN>);
> open(FASTAFILE1, $fastafilename1) ||
> die("Cannot open file for reading: $!");
>
> #read in the first file
> #and filling two hashes
> #values in hash_pattern later used for pattern-matching
>
> my $k = 0;
> my $i = 0;
> while (<FASTAFILE1>) {
> if(/^>/) {
Isn't there a space before the '>' character?
> chomp;
> $i =1;
> if ($k==1) {
> chomp;
You have already chomped the line, there is no point in chomping it
again.
> $hash_fasta1{$key_fasta1} = $value_fasta1;
> $value_fasta1 ='';
> }
> else {
> $key_fasta1 = $_;
> $key_pattern = $_;
> }
> }
> else {
> if ($i==0) {
> chomp;
You have already chomped the line, there is no point in chomping it
again.
> $value_fasta1 = $value_fasta1 . $_;
> $k=1;
> }
> else {
> $i = 0;
> chomp;
You have already chomped the line, there is no point in chomping it
again.
> $hash_pattern{$key_pattern} = $_;
> $value_fasta1 = $value_fasta1 . $_;
> $k=1;
> }
> }
> }
>
> $hash_fasta1{$key_fasta1} = $value_fasta1;
> delete $hash_fasta1 {''};
> delete $hash_pattern {''};
You should probably avoid adding that key in the first place instead of
deleting it later.
> close(FASTAFILE1) || die("Can't close in file: $!") ;
>
> #read in the second file
>
> print "Please type the name of the first file:\n\n";
Don't you mean the second file? :-)
> chomp ($fastafilename2 = <STDIN>);
> open(FASTAFILE2, $fastafilename2) ||
> die("Cannot open file for reading: $!");
>
> my $j = 0;
> while (<FASTAFILE2>) {
> if(/^>/) {
> chomp;
> if ($j==1) {
> chomp;
> $hash_fasta2{$key_fasta2} = $value_fasta2;
> $value_fasta2 ='';
> }
> else {
> $key_fasta2 = $_;
> }
> } else {
> chomp;
> $value_fasta2 = $value_fasta2 . $_;
> $j=1;
> }
> }
>
> $hash_fasta2{$key_fasta2} = $value_fasta2;
> delete $hash_fasta2 {''};
> close(FASTAFILE2) || die("Can't close in file: $!") ;
>
> my $outputfile = '';
>
> #open outputfile
> $outputfile = "both_IDs";
> unless (open(BOTH_IDS, ">$outputfile") ) {
> print "Cannot open file \"$outputfile\" to write to!!\n\n";
> exit;
> }
>
> my @array1 = keys %hash_fasta1;
> my @array2 = keys %hash_fasta2;
>
> ##################################################################
> # Because I only found one sequence, which is in both fasta-files,
> # I tried to find out, if the hashes are correctly filled.
> # So here I put the code as described** and got the different
> # output. Rest of code:
>
> my $key_hash1 = '';
> my $key_hash2 = '';
>
> #pattern-matching
>
> foreach (@array1) {
> $key_hash1 = $_;
foreach $key_hash1 (@array1) {
> foreach (@array2) {
> $key_hash2 = $_;
foreach $key_hash2 (@array2) {
> if ($hash_fasta2{$key_hash2} =~ $hash_pattern{$key_hash1}) {
You could use index() to do that and you wouldn't have to worry about
regex meta-characters.
> print "Pattern $hash_pattern{$key_hash1} is in
> $hash_fasta2{$key_hash2} \n";
> print BOTH_IDS $key_hash1 . $key_hash2 .
> $hash_fasta2{$key_hash2};
> }
> }
> }
>
> #close outputfile
> close (BOTH_IDS) || die("Can't close in file: $!");
If you want to associate all sequences with their corresponding IDs then
use a hash of arrays:
#!/usr/local/bin/perl
use warnings;
use strict;
print "Please type the name of the first file:\n";
chomp( @ARGV = scalar <STDIN> );
print "Please type the name of the second file:\n\n";
chomp( $ARGV[ 1 ] = <STDIN> );
my %hash_fasta;
$/ = ' >';
while ( <> ) {
my ( $id, $seq ) = split( /\n/, $_, 2 ) or next;
$seq =~ tr/A-Z//cd; # remove non-sequence characters
push @{ $hash_fasta{ $id } }, $seq;
}
for my $id ( keys %hash_fasta ) {
print " >$id\n";
print "$_\n" for @{ $hash_fasta{ $id } };
}
__END__
John
--
use Perl;
program
fulfillment
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]