[ruby.parslet] Re: parsing large input

Kaspar Schiess Wed, 07 Dec 2011 01:03:06 -0800

Hei Melissa,

> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
> GB total).
> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
> to complete.
> I have not tried treetop.


These numbers about add up - looks as if you hit real limitations with 
the current parslet release. Parslet currently keeps a few things around 
that might be adding up to that memory consumption:

   * an unreduced parse result (containing all the itsy bits, even those 
you did not capture with .as()
   * a packrat cache mapping positions in the file to parse results 
(this alone will be around 300mb in your case probably)
   * a line cache mapping ranges in the input to line numbers

in addition to that, parslet will generate a lot of transient objects, 
generating load on the GC. (errors, result objects, etc...).

Two initial remarks to the above facts:

   * Parslet really is made for parsing source code currently. This 
means relatively small input files with a high amount of complexity. 
Implementation choices that derive from that will make it unsuitable for 
parsing files with low complexity that are large in size.

   * Choice of Ruby implementation affects parslet heavily. You might 
experiment with other implementations (REE?).

This doesn't go to say that I don't want to lift these limitations 
sometime in the future, it's just hard to give a schedule for this.

> The parser is for EDI files, specifically EDI 837 for now. Input looks
> like this:
> [... deletia ...]
>
> It's positively atrocious. :)
>
> The reason I am attempting to use parslet is:
>
> 1) Need the output to be in a hierarchal hash that matches the input
> hierarchy. - Parslet output is this already!
> 2) Need to be able to convert every instance of certain segments into
> certain formats - Transform works great!
> 3) Need to be able to handle dirty input that does not follow the
> spec. Other solutions out there for these file types either require
> that data follow the specification and proper segment order or are
> cumbersome to customize.
> 4) Needs to be in Ruby.

If it has to be parslet, you might want to throw hardware at this 
problem right now. When I look at your grammar specification, I notice 
that you might be better off with a hand written parser; your grammar 
isn't very complex (it's alike to the simpler parts of XML) and you 
mention wanting to parse degenerate input - something parslet will fail 
horribly at.

> Splitting the input and parsing it top-level group at a time, and sort
> of building the hash hierarchy myself, I can keep the memory usage
> down.
This sounds like another good practical solution to me.

> I am trying to find ways that maybe I can write the rules better, but
> I might just have to parse smaller bits of it at a time. Any tips or
> ideas you have, I'll try.
I am not convinced that there is a lot you can do. For an input/parser 
combination like this:

   'abc'
   str('a') >> str('b').as(:b) >> str('c')

parslet will first build this internally:

   ['a'@0, {:b => 'b'@1}, 'c'@2]

which will then at the very end be reduced to:

   {':b => 'b'@1}

So jiggling the rules will probably not affect memory footprint during 
parsing.

regards,
kaspar

[ruby.parslet] Re: parsing large input

Reply via email to