Re: solution for Regex

2011-06-11 Thread John Delacour

At 08:28 -0700 10/06/2011, Gurpreet Singh wrote:


Correct me if i am wrong -

$read = 0 unless /\s[A-Z]{3}:/;
This might pick up wrong values also - since one of the DBLINKS data 
(the first record) might also get picked up - it should match this 
regex.


Run my script with the data and see.  $read is already false at that 
point.  If the data were differently arranged then maybe, but I was 
not proposing a universal solution - just a suggestion as to a 
different approach.


JD



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-10 Thread John Delacour

At 18:50 -0400 09/06/2011, Uri Guttman wrote:


...i don't know the data logic so i can't go further. at least you 
can run this and assign it to $read to remove redundancy. also you 
can declare $read here.


my $read = s/^GENES//;


No it isn't.  If you don't know the data logic then it's because 
you have not read the message that started the thread, which contains 
the data.



 $read = 0 unless /\s[A-Z]{3}:/;


since you don't do the work unless that passes, just next away:

next if /\s[A-Z]{3}:/;


Wrong again.  That will collect zilch.  Do your homework.



JD

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-10 Thread Uri Guttman
 JD == John Delacour johndelac...@gmail.com writes:

  JD At 18:50 -0400 09/06/2011, Uri Guttman wrote:
   ...i don't know the data logic so i can't go further. at least you
   can run this and assign it to $read to remove redundancy. also you
   can declare $read here.
   
   my $read = s/^GENES//;

  JD No it isn't.  If you don't know the data logic then it's because you
  JD have not read the message that started the thread, which contains the
  JD data.

   $read = 0 unless /\s[A-Z]{3}:/;
   
   since you don't do the work unless that passes, just next away:
   
   next if /\s[A-Z]{3}:/;

  JD Wrong again.  That will collect zilch.  Do your homework.

not interested right now. i am sure i can clean it up if i did. flags
like that are a red flag that there is something wrong in the
design. too late for me to get into it now.

uri

-- 
Uri Guttman  --  u...@stemsystems.com    http://www.sysarch.com --
-  Perl Code Review , Architecture, Development, Training, Support --
-  Gourmet Hot Cocoa Mix    http://bestfriendscocoa.com -

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-10 Thread Gurpreet Singh
Hi,
Correct me if i am wrong - 

$read = 0 unless /\s[A-Z]{3}:/;
This might pick up wrong values also - since one of the DBLINKS data (the first 
record) might also get picked up - it should match this regex.


-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-09 Thread Jim Gibson

At 10:48 AM +0200 6/9/11, venkates wrote:

Hi,

data snippet:


I need to retrieve all the gene entries to add it to a hash ref. My 
code does that in the first record but in the second case it also 
pulls out the REFERENCE information. I have provided the code below. 
If some one could tell me where exactly I am going wrong (is it in 
the regex? or otherwise) I would be glad!!


code :

use strict;
use warnings;
use Carp;
use Data::Dumper;


my $set = parse(/home/venkates/workspace/KEGG_Parser/data/ko);

sub parse {

my $kegg_file_path = shift;
my $keggData; # Hash ref


Please simplify your program for posting by using a hash instead of a 
hash reference. Your goal should be to make it as easy as possible 
for people to help you. Once you learn how to solve your problems, 
you can use the solution in your actual program with whatever 
complexity is necessary.




open my $fh, '', $kegg_file_path or croak(Cannot open file 
'$kegg_file_path': $!);

local $/ = \n///\n;
while ($fh){
chomp;
my $record = $_;



Why don't you just read into $record in the first place:

while( my $record = $fh ) [



$record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
my $entries = $1;
if ($record =~ m/^GENES\s{7}(.+)$/xms){



You are capturing everything from just after GENES to the end of the 
record. Try putting in REFERENCE:


if ($record =~ m/^GENES\s{7}(.+)REFERENCE/xms){



my $gene = $1;
${$keggData}{$entries}{'GENE'} = $gene;
my @genes = split ('\s{13}', $gene);
foreach my $gene_element (@genes){
my $taxon_label = substr($gene_element, 0, 3);
my $gene_label = substr($gene_element, 5);
my @gene_label_array = split '\s', $gene_label;
push 
@{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;

}
}

}
print Dumper($keggData);
close $fh;
}


Please use the DATA file handle to make it easier to run your 
program. Put your file data at the end of the program after the line


__DATA__

then use DATA to read the data lines.

Thanks.

--
Jim Gibson
j...@gibson.org

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-09 Thread Dr.Ruud

On 2011-06-09 10:48, venkates wrote:


my @gene_label_array = split '\s', $gene_label;


That '\s' is more clearly written as /\s/ or for example m{\s}.

But best just make it ' ' (see perldoc -f split, about that special case).

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-09 Thread Shawn H Corey

On 11-06-09 01:48 PM, Dr.Ruud wrote:

On 2011-06-09 10:48, venkates wrote:


my @gene_label_array = split '\s', $gene_label;


That '\s' is more clearly written as /\s/ or for example m{\s}.

But best just make it ' ' (see perldoc -f split, about that special case).



FYI, some people find it hard to distinguish between ' ' and '', so they 
write it, \x20.  If you ever see this, you now know why.  :)



--
Just my 0.0002 million dollars worth,
  Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software:  Fail early  often.

Eliminate software piracy:  use only FLOSS.

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-09 Thread John Delacour

At 10:48 +0200 09/06/2011, venkates wrote:

I need to retrieve all the gene entries to add it to a hash ref.

Your code is very fussy with all those substrings etc.  What about 
something like this:


#!/usr/local/bin/perl
use strict;
my $read = 0; my @genes; my %hash;
while (DATA){
  chomp;
  $read = 1 if /^GENES/;
  $read = 0 unless /\s[A-Z]{3}:/;
  if ($read){
s/^GENES//;
s/^\s+//;
push @genes, $_;
  }
}
for (@genes){
  my ($taxon_label, $gene_label) = split /:\s*/;
  $hash{$taxon_label} = $gene_label;
}
for (keys %hash){
  print Key: $_; Val: $hash{$_}\n;
}
__DATA__
Your file contents here



JD

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: solution for Regex

2011-06-09 Thread Rob Dixon
On 09/06/2011 09:48, venkates wrote:
 Hi,
 
 data snippet:
 
 ENTRY K2 KO
 NAME E1.1.1.2, adh
 DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
 PATHWAY ko00010 Glycolysis / Gluconeogenesis
 ko00561 Glycerolipid metabolism
 ko00930 Caprolactam degradation
 CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis 
 [PATH:ko00010]
 Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
 Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam 
 degradation [PATH:ko00930]
 DBLINKS RN: R00746 R01041 R05231
 COG: COG0656
 GO: 0008106
 GENES HSA: 10327(AKR1A1)
 PTR: 741418(AKR1A1)
 PON: 100173796(AKR1A1)
 MCC: 693380(AKR1A1)
 MMU: 58810(Akr1a4)
 RNO: 78959(Akr1a1)
 CFA: 610537
 ///
 ENTRY K00730 KO
 NAME OST4
 DEFINITION oligosaccharyl transferase complex subunit OST4
 PATHWAY ko00510 N-Glycan biosynthesis
 ko00513 Various types of N-glycan biosynthesis
 ko04141 Protein processing in endoplasmic reticulum
 MODULE M00072 Oligosaccharyltransferase
 CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan 
 biosynthesis [PATH:ko00510]
 Metabolism; Glycan Biosynthesis and Metabolism; Various types of 
 N-glycan biosynthesis [PATH:ko00513]
 Genetic Information Processing; Folding, Sorting and Degradation; 
 Protein processing in endoplasmic reticulum [PATH:ko04141]
 DBLINKS GO: 0008250
 GENES SCE: YDL232W(OST4)
 AGO: AGOS_ABL170C
 KLA: KLLA0A01287g
 VPO: Kpol_1054p35
 SSL: SS1G_13465
 REFERENCE PMID:15001703
 AUTHORS Zubkov S, Lennarz WJ, Mohanty S
 TITLE Structural basis for the function of a minimembrane protein 
 subunit of yeast oligosaccharyltransferase.
 JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
 ///
 
 I need to retrieve all the gene entries to add it to a hash ref. My code 
 does that in the first record but in the second case it also pulls out 
 the REFERENCE information. I have provided the code below. If some one 
 could tell me where exactly I am going wrong (is it in the regex? or 
 otherwise) I would be glad!!
 
 code :
 
 use strict;
 use warnings;
 use Carp;
 use Data::Dumper;
 
 
 my $set = parse(/home/venkates/workspace/KEGG_Parser/data/ko);
 
 sub parse {
 
 my $kegg_file_path = shift;
 my $keggData; # Hash ref
 
 open my $fh, '', $kegg_file_path or croak(Cannot open file 
 '$kegg_file_path': $!);
 local $/ = \n///\n;
 while ($fh){
 chomp;
 my $record = $_;
 $record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
 my $entries = $1;
 if ($record =~ m/^GENES\s{7}(.+)$/xms){
 my $gene = $1;
 ${$keggData}{$entries}{'GENE'} = $gene;
 my @genes = split ('\s{13}', $gene);
 foreach my $gene_element (@genes){
 my $taxon_label = substr($gene_element, 0, 3);
 my $gene_label = substr($gene_element, 5);
 my @gene_label_array = split '\s', $gene_label;
 push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;
 }
 }
 
 }
 print Dumper($keggData);
 close $fh;
 }

I would prefer to read the file a line at a time. The code below seems 
to do what you want.

HTH,

Rob


use strict;
use warnings;

use Data::Dumper;

my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko';

my $fh;
unless (open $fh, $kegg_file) {
  warn Failed to open file: $!. Defaulting to DATA.;
  $fh = *DATA;
} 

parse($fh);

sub parse {

  my $kegg_file_handle = shift;
  my $keggData;

  my $entry;
  my $key;

  while ($fh) {
   
next unless /\S/;
if (m|///|) {
   undef $entry;
   undef $key;
   next;
}

chomp;

next unless m|^(.{0,11}?)\s+(.+)|;

$key = $1 if $1;
my $val = $2;

if ($key eq 'ENTRY') {
  ($entry) = $val =~ /(\S+)/;
}
elsif ($key eq 'GENES') {
  die No current entry unless $entry;
  my ($taxon_label, @gene_label_array) = split /:?\s+/, $val;
  push @{$keggData-{$entry}{$key}{$taxon_label}}, @gene_label_array;
}
  }

  print Dumper($keggData);
}

__DATA__
ENTRY   K2  KO
NAMEE1.1.1.2, adh
DEFINITION  alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
PATHWAY ko00010  Glycolysis / Gluconeogenesis
ko00561  Glycerolipid metabolism
ko00930  Caprolactam degradation
CLASS   Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis 
[PATH:ko00010]
Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam 
degradation [PATH:ko00930]
DBLINKS RN: R00746 R01041 R05231
COG: COG0656
GO: 0008106
GENES   HSA: 10327(AKR1A1)
PTR: 741418(AKR1A1)
PON: 100173796(AKR1A1)
MCC: 693380(AKR1A1)
MMU: 58810(Akr1a4)
RNO: 78959(Akr1a1)
CFA: 610537
///
ENTRY   K00730  KO
NAMEOST4
DEFINITION  oligosaccharyl transferase complex subunit OST4
PATHWAY ko00510  N-Glycan biosynthesis
ko00513  Various types of N-glycan biosynthesis
ko04141  Protein processing in endoplasmic reticulum
MODULE

Re: solution for Regex

2011-06-09 Thread Uri Guttman
 JD == John Delacour johndelac...@gmail.com writes:

  JD use strict;

use warnings ;

  JD my $read = 0; my @genes; my %hash;

why the $read flag?

  JD while (DATA){
  JD   chomp;
  JD   $read = 1 if /^GENES/;

you can always just so this and test for it. i don't know the data logic
so i can't go further. at least you can run this and assign it to $read
to remove redundancy. also you can declare $read here.

my $read = s/^GENES//;

  JD   $read = 0 unless /\s[A-Z]{3}:/;

since you don't do the work unless that passes, just next away:

next if /\s[A-Z]{3}:/;

now you don't need to test $read at all. again, i haven't checked the
flow logic so i could be wrong. i just smell better logic here. having
data in this example (sure i can look at the OP's post but i am tired. :)

  JD   if ($read){
  JD s/^GENES//;

that line isn't neded.

  JD s/^\s+//;
  JD push @genes, $_;
  JD   }
  JD }
  JD for (@genes){
  JD   my ($taxon_label, $gene_label) = split /:\s*/;
  JD   $hash{$taxon_label} = $gene_label;
  JD }

if the list isn't that long you can map/split in one cleaner line. and
don't use %hash for a hash name.

my %labeled_genes = map { split /:\s*/ } @genes ;

uri

-- 
Uri Guttman  --  u...@stemsystems.com    http://www.sysarch.com --
-  Perl Code Review , Architecture, Development, Training, Support --
-  Gourmet Hot Cocoa Mix    http://bestfriendscocoa.com -

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/