Grammars and biological data formats

Fields, Christopher J Sat, 09 Aug 2014 15:11:22 -0700

(accidentally sent to perl6-lang, apologies for cross-posting but this seems 
more appropriate)


I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.  

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

   Grammar.parsefile($file)

appears to be a convenient shorthand for:

   Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

   http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

   https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

   https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris

Grammars and biological data formats

Reply via email to