I've already been thinking for awhile now that parsers need to be able to
operate in a streaming fashion (when the grammars lend themselves to it, by not
needing to lookahead, much if at all, to understand what they've already seen)
so that strings that don't fit in memory all at once can be parsed.
Any parser that returns results piecewise to the caller rather than all at once,
such as by supporting callbacks, already makes for a streaming interface on that
end, so it just needs to be lazy on the input end as well, and then one can
parse arbitrary sized inputs while using little memory.
Christopher's example is a good one.
Another example that I would deal with is database dumps; the parsers in psql or
mysql or others can obviously handle SQL dump files that are many gigabytes and
are obviously parsing them in a streaming manner, but SQL files are really just
program source code files.
-- Darren Duncan
On 2014-08-09, 3:09 PM, Fields, Christopher J wrote:
(accidentally sent to perl6-lang, apologies for cross-posting but this seems
more appropriate)
I have a fairly simple question regarding the feasibility of using grammars
with commonly used biological data formats.
My main question: if I wanted to parse() or subparse() vary large files (not
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB)
would a grammar be the best solution? For instance, based on what I am reading
the semantics appear to be greedy; for instance:
Grammar.parsefile($file)
appears to be a convenient shorthand for:
Grammar.parse($file.slurp)
since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I
misunderstanding how this could be accomplished?
(just to point out, I know I can subparse() as well but that also appears to
act on a string…)
As an example, I have a simple grammar for parsing FASTA, which a (deceptively)
simple format for storing sequence data:
http://en.wikipedia.org/wiki/FASTA_format
I have a simple grammar here:
https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6
and tests here:
https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t
Tests pass with the latest Rakudo just fine.
chris