Re: solution for Regex
At 08:28 -0700 10/06/2011, Gurpreet Singh wrote: Correct me if i am wrong - $read = 0 unless /\s[A-Z]{3}:/; This might pick up wrong values also - since one of the DBLINKS data (the first record) might also get picked up - it should match this regex. Run my script with the data and see. $read is already false at that point. If the data were differently arranged then maybe, but I was not proposing a universal solution - just a suggestion as to a different approach. JD -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
At 18:50 -0400 09/06/2011, Uri Guttman wrote: ...i don't know the data logic so i can't go further. at least you can run this and assign it to $read to remove redundancy. also you can declare $read here. my $read = s/^GENES//; No it isn't. If you don't know the data logic then it's because you have not read the message that started the thread, which contains the data. $read = 0 unless /\s[A-Z]{3}:/; since you don't do the work unless that passes, just next away: next if /\s[A-Z]{3}:/; Wrong again. That will collect zilch. Do your homework. JD -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
JD == John Delacour johndelac...@gmail.com writes: JD At 18:50 -0400 09/06/2011, Uri Guttman wrote: ...i don't know the data logic so i can't go further. at least you can run this and assign it to $read to remove redundancy. also you can declare $read here. my $read = s/^GENES//; JD No it isn't. If you don't know the data logic then it's because you JD have not read the message that started the thread, which contains the JD data. $read = 0 unless /\s[A-Z]{3}:/; since you don't do the work unless that passes, just next away: next if /\s[A-Z]{3}:/; JD Wrong again. That will collect zilch. Do your homework. not interested right now. i am sure i can clean it up if i did. flags like that are a red flag that there is something wrong in the design. too late for me to get into it now. uri -- Uri Guttman -- u...@stemsystems.com http://www.sysarch.com -- - Perl Code Review , Architecture, Development, Training, Support -- - Gourmet Hot Cocoa Mix http://bestfriendscocoa.com - -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
Hi, Correct me if i am wrong - $read = 0 unless /\s[A-Z]{3}:/; This might pick up wrong values also - since one of the DBLINKS data (the first record) might also get picked up - it should match this regex. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
At 10:48 AM +0200 6/9/11, venkates wrote: Hi, data snippet: I need to retrieve all the gene entries to add it to a hash ref. My code does that in the first record but in the second case it also pulls out the REFERENCE information. I have provided the code below. If some one could tell me where exactly I am going wrong (is it in the regex? or otherwise) I would be glad!! code : use strict; use warnings; use Carp; use Data::Dumper; my $set = parse(/home/venkates/workspace/KEGG_Parser/data/ko); sub parse { my $kegg_file_path = shift; my $keggData; # Hash ref Please simplify your program for posting by using a hash instead of a hash reference. Your goal should be to make it as easy as possible for people to help you. Once you learn how to solve your problems, you can use the solution in your actual program with whatever complexity is necessary. open my $fh, '', $kegg_file_path or croak(Cannot open file '$kegg_file_path': $!); local $/ = \n///\n; while ($fh){ chomp; my $record = $_; Why don't you just read into $record in the first place: while( my $record = $fh ) [ $record =~ m/^ENTRY\s{7}(.+?)\s+/xms; my $entries = $1; if ($record =~ m/^GENES\s{7}(.+)$/xms){ You are capturing everything from just after GENES to the end of the record. Try putting in REFERENCE: if ($record =~ m/^GENES\s{7}(.+)REFERENCE/xms){ my $gene = $1; ${$keggData}{$entries}{'GENE'} = $gene; my @genes = split ('\s{13}', $gene); foreach my $gene_element (@genes){ my $taxon_label = substr($gene_element, 0, 3); my $gene_label = substr($gene_element, 5); my @gene_label_array = split '\s', $gene_label; push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array; } } } print Dumper($keggData); close $fh; } Please use the DATA file handle to make it easier to run your program. Put your file data at the end of the program after the line __DATA__ then use DATA to read the data lines. Thanks. -- Jim Gibson j...@gibson.org -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
On 2011-06-09 10:48, venkates wrote: my @gene_label_array = split '\s', $gene_label; That '\s' is more clearly written as /\s/ or for example m{\s}. But best just make it ' ' (see perldoc -f split, about that special case). -- Ruud -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
On 11-06-09 01:48 PM, Dr.Ruud wrote: On 2011-06-09 10:48, venkates wrote: my @gene_label_array = split '\s', $gene_label; That '\s' is more clearly written as /\s/ or for example m{\s}. But best just make it ' ' (see perldoc -f split, about that special case). FYI, some people find it hard to distinguish between ' ' and '', so they write it, \x20. If you ever see this, you now know why. :) -- Just my 0.0002 million dollars worth, Shawn Confusion is the first step of understanding. Programming is as much about organization and communication as it is about coding. The secret to great software: Fail early often. Eliminate software piracy: use only FLOSS. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
At 10:48 +0200 09/06/2011, venkates wrote: I need to retrieve all the gene entries to add it to a hash ref. Your code is very fussy with all those substrings etc. What about something like this: #!/usr/local/bin/perl use strict; my $read = 0; my @genes; my %hash; while (DATA){ chomp; $read = 1 if /^GENES/; $read = 0 unless /\s[A-Z]{3}:/; if ($read){ s/^GENES//; s/^\s+//; push @genes, $_; } } for (@genes){ my ($taxon_label, $gene_label) = split /:\s*/; $hash{$taxon_label} = $gene_label; } for (keys %hash){ print Key: $_; Val: $hash{$_}\n; } __DATA__ Your file contents here JD -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: solution for Regex
On 09/06/2011 09:48, venkates wrote: Hi, data snippet: ENTRY K2 KO NAME E1.1.1.2, adh DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2] PATHWAY ko00010 Glycolysis / Gluconeogenesis ko00561 Glycerolipid metabolism ko00930 Caprolactam degradation CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis [PATH:ko00010] Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561] Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam degradation [PATH:ko00930] DBLINKS RN: R00746 R01041 R05231 COG: COG0656 GO: 0008106 GENES HSA: 10327(AKR1A1) PTR: 741418(AKR1A1) PON: 100173796(AKR1A1) MCC: 693380(AKR1A1) MMU: 58810(Akr1a4) RNO: 78959(Akr1a1) CFA: 610537 /// ENTRY K00730 KO NAME OST4 DEFINITION oligosaccharyl transferase complex subunit OST4 PATHWAY ko00510 N-Glycan biosynthesis ko00513 Various types of N-glycan biosynthesis ko04141 Protein processing in endoplasmic reticulum MODULE M00072 Oligosaccharyltransferase CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan biosynthesis [PATH:ko00510] Metabolism; Glycan Biosynthesis and Metabolism; Various types of N-glycan biosynthesis [PATH:ko00513] Genetic Information Processing; Folding, Sorting and Degradation; Protein processing in endoplasmic reticulum [PATH:ko04141] DBLINKS GO: 0008250 GENES SCE: YDL232W(OST4) AGO: AGOS_ABL170C KLA: KLLA0A01287g VPO: Kpol_1054p35 SSL: SS1G_13465 REFERENCE PMID:15001703 AUTHORS Zubkov S, Lennarz WJ, Mohanty S TITLE Structural basis for the function of a minimembrane protein subunit of yeast oligosaccharyltransferase. JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004) /// I need to retrieve all the gene entries to add it to a hash ref. My code does that in the first record but in the second case it also pulls out the REFERENCE information. I have provided the code below. If some one could tell me where exactly I am going wrong (is it in the regex? or otherwise) I would be glad!! code : use strict; use warnings; use Carp; use Data::Dumper; my $set = parse(/home/venkates/workspace/KEGG_Parser/data/ko); sub parse { my $kegg_file_path = shift; my $keggData; # Hash ref open my $fh, '', $kegg_file_path or croak(Cannot open file '$kegg_file_path': $!); local $/ = \n///\n; while ($fh){ chomp; my $record = $_; $record =~ m/^ENTRY\s{7}(.+?)\s+/xms; my $entries = $1; if ($record =~ m/^GENES\s{7}(.+)$/xms){ my $gene = $1; ${$keggData}{$entries}{'GENE'} = $gene; my @genes = split ('\s{13}', $gene); foreach my $gene_element (@genes){ my $taxon_label = substr($gene_element, 0, 3); my $gene_label = substr($gene_element, 5); my @gene_label_array = split '\s', $gene_label; push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array; } } } print Dumper($keggData); close $fh; } I would prefer to read the file a line at a time. The code below seems to do what you want. HTH, Rob use strict; use warnings; use Data::Dumper; my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko'; my $fh; unless (open $fh, $kegg_file) { warn Failed to open file: $!. Defaulting to DATA.; $fh = *DATA; } parse($fh); sub parse { my $kegg_file_handle = shift; my $keggData; my $entry; my $key; while ($fh) { next unless /\S/; if (m|///|) { undef $entry; undef $key; next; } chomp; next unless m|^(.{0,11}?)\s+(.+)|; $key = $1 if $1; my $val = $2; if ($key eq 'ENTRY') { ($entry) = $val =~ /(\S+)/; } elsif ($key eq 'GENES') { die No current entry unless $entry; my ($taxon_label, @gene_label_array) = split /:?\s+/, $val; push @{$keggData-{$entry}{$key}{$taxon_label}}, @gene_label_array; } } print Dumper($keggData); } __DATA__ ENTRY K2 KO NAMEE1.1.1.2, adh DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2] PATHWAY ko00010 Glycolysis / Gluconeogenesis ko00561 Glycerolipid metabolism ko00930 Caprolactam degradation CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis [PATH:ko00010] Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561] Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam degradation [PATH:ko00930] DBLINKS RN: R00746 R01041 R05231 COG: COG0656 GO: 0008106 GENES HSA: 10327(AKR1A1) PTR: 741418(AKR1A1) PON: 100173796(AKR1A1) MCC: 693380(AKR1A1) MMU: 58810(Akr1a4) RNO: 78959(Akr1a1) CFA: 610537 /// ENTRY K00730 KO NAMEOST4 DEFINITION oligosaccharyl transferase complex subunit OST4 PATHWAY ko00510 N-Glycan biosynthesis ko00513 Various types of N-glycan biosynthesis ko04141 Protein processing in endoplasmic reticulum MODULE
Re: solution for Regex
JD == John Delacour johndelac...@gmail.com writes: JD use strict; use warnings ; JD my $read = 0; my @genes; my %hash; why the $read flag? JD while (DATA){ JD chomp; JD $read = 1 if /^GENES/; you can always just so this and test for it. i don't know the data logic so i can't go further. at least you can run this and assign it to $read to remove redundancy. also you can declare $read here. my $read = s/^GENES//; JD $read = 0 unless /\s[A-Z]{3}:/; since you don't do the work unless that passes, just next away: next if /\s[A-Z]{3}:/; now you don't need to test $read at all. again, i haven't checked the flow logic so i could be wrong. i just smell better logic here. having data in this example (sure i can look at the OP's post but i am tired. :) JD if ($read){ JD s/^GENES//; that line isn't neded. JD s/^\s+//; JD push @genes, $_; JD } JD } JD for (@genes){ JD my ($taxon_label, $gene_label) = split /:\s*/; JD $hash{$taxon_label} = $gene_label; JD } if the list isn't that long you can map/split in one cleaner line. and don't use %hash for a hash name. my %labeled_genes = map { split /:\s*/ } @genes ; uri -- Uri Guttman -- u...@stemsystems.com http://www.sysarch.com -- - Perl Code Review , Architecture, Development, Training, Support -- - Gourmet Hot Cocoa Mix http://bestfriendscocoa.com - -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/