Hei Melissa,
> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
> GB total).
> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
> to complete.
> I have not tried treetop.
These numbers about add up - looks as if you hit real limitations with
the current parslet release. Parslet currently keeps a few things around
that might be adding up to that memory consumption:
* an unreduced parse result (containing all the itsy bits, even those
you did not capture with .as()
* a packrat cache mapping positions in the file to parse results
(this alone will be around 300mb in your case probably)
* a line cache mapping ranges in the input to line numbers
in addition to that, parslet will generate a lot of transient objects,
generating load on the GC. (errors, result objects, etc...).
Two initial remarks to the above facts:
* Parslet really is made for parsing source code currently. This
means relatively small input files with a high amount of complexity.
Implementation choices that derive from that will make it unsuitable for
parsing files with low complexity that are large in size.
* Choice of Ruby implementation affects parslet heavily. You might
experiment with other implementations (REE?).
This doesn't go to say that I don't want to lift these limitations
sometime in the future, it's just hard to give a schedule for this.
> The parser is for EDI files, specifically EDI 837 for now. Input looks
> like this:
> [... deletia ...]
>
> It's positively atrocious. :)
>
> The reason I am attempting to use parslet is:
>
> 1) Need the output to be in a hierarchal hash that matches the input
> hierarchy. - Parslet output is this already!
> 2) Need to be able to convert every instance of certain segments into
> certain formats - Transform works great!
> 3) Need to be able to handle dirty input that does not follow the
> spec. Other solutions out there for these file types either require
> that data follow the specification and proper segment order or are
> cumbersome to customize.
> 4) Needs to be in Ruby.
If it has to be parslet, you might want to throw hardware at this
problem right now. When I look at your grammar specification, I notice
that you might be better off with a hand written parser; your grammar
isn't very complex (it's alike to the simpler parts of XML) and you
mention wanting to parse degenerate input - something parslet will fail
horribly at.
> Splitting the input and parsing it top-level group at a time, and sort
> of building the hash hierarchy myself, I can keep the memory usage
> down.
This sounds like another good practical solution to me.
> I am trying to find ways that maybe I can write the rules better, but
> I might just have to parse smaller bits of it at a time. Any tips or
> ideas you have, I'll try.
I am not convinced that there is a lot you can do. For an input/parser
combination like this:
'abc'
str('a') >> str('b').as(:b) >> str('c')
parslet will first build this internally:
['a'@0, {:b => 'b'@1}, 'c'@2]
which will then at the very end be reduced to:
{':b => 'b'@1}
So jiggling the rules will probably not affect memory footprint during
parsing.
regards,
kaspar