Re: [ruby.parslet] Re: parsing large input

Melissa Whittington Thu, 01 Dec 2011 11:36:19 -0800

Sorry, I was out on vacation for a while. Back now. :)

Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
GB total).
Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
to complete.
I have not tried treetop.

The parser is for EDI files, specifically EDI 837 for now. Input looks
like this:

ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~

Three delimiters, ~, *, and :.
~ is the segment delimiter.
* is the field delimiter within a segment.
: is the subfield delimiter within a field.

(For reference, the 15 MB input has 643k segments, 1 MB input has 44k
segments, and .25 MB input has 11k segments.)

The first field of a segment is the header that says what kind of segment it is.
Segments can be grouped together in loops that can be repeated.

The hierarchy of input goes like this:

Document 1
- Group 1a
- - Group 2a
- - - Group 3a
- - - - Group 4a
- - - - Group 4b
- - - Group 3b
- - - - Group 4a
- - - - Group 4b
- - Group 2b
... etc.
- Group 1b
.. etc.
Document 2
... etc.

Each group can be repeated any number of times below its parent group.
Each group has a specification of certain beginning segments, loops of
segments, sub groups, and ending segments.

It's positively atrocious. :)

The reason I am attempting to use parslet is:

1) Need the output to be in a hierarchal hash that matches the input
hierarchy. - Parslet output is this already!
2) Need to be able to convert every instance of certain segments into
certain formats - Transform works great!
3) Need to be able to handle dirty input that does not follow the
spec. Other solutions out there for these file types either require
that data follow the specification and proper segment order or are
cumbersome to customize.
4) Needs to be in Ruby.

So I wrote rules that define fields and segments, with which I wrote
rules to define each segment.
Using those segment rules, I can quite nicely define a parser for EDI
837 input. A rule looks like this:

  rule(:entity) do
    nm1.as(:_nm1).as(:name) >>
    address.as(:address).maybe >>
    dates.maybe >>
    dmg.as(:_dmg).as(:demographics).maybe >>
    prv.as(:_prv).as(:speciality).maybe >>
    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
    per.as(:_per).repeat.as(:contact).maybe
  end

address/dates are rules for groups of segments.
nm1, dmg, prv, ref, per are all specific segment rules, which are
defined like this:

rule(:nm1) { str('NM1').as(:id) >> fields }
rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }

Everything except the first segment in a rule is .maybe because it
might not be there.
When I need to handle a segment that doesn't follow the spec, I can
write a new :entity rule (for example) that includes a different set
of segments, and ta-da! Working parser!

Splitting the input and parsing it top-level group at a time, and sort
of building the hash hierarchy myself, I can keep the memory usage
down.

I am trying to find ways that maybe I can write the rules better, but
I might just have to parse smaller bits of it at a time. Any tips or
ideas you have, I'll try.

Hope I didn't bore anyone too much. :)

-mj

On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <[email protected]> wrote:
> Hei Melissa,
>
> In short: Please, give us something to work with, we'll try to improve!
>
> kaspar
>

Re: [ruby.parslet] Re: parsing large input

Reply via email to