Re: [ruby.parslet] parsing large input

Chris Corbyn Thu, 01 Dec 2011 14:23:21 -0800

Hi Melissa,

So is this example one "document", where "ST" is always the header that starts 
the document and each subsequent header inside it starts a new group within the 
document?  Presumably the parts with long strings of ***** means there are 
empty fields in that segment.  Trying to get my head around how that hierarchy 
looks in practice.  Can you add whitespace to the actual input, for display 
purposes only, to indicate where this grouping and nesting is occurring in that 
excerpt you posted?


Cheers,

Chris


On 02/12/2011, at 6:35 AM, Melissa Whittington wrote:

> Sorry, I was out on vacation for a while. Back now. :)
> 
> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
> GB total).
> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
> to complete.
> I have not tried treetop.
> 
> The parser is for EDI files, specifically EDI 837 for now. Input looks
> like this:
> 
> ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
> HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
> AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
> HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
> NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
> LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~
> 
> Three delimiters, ~, *, and :.
> ~ is the segment delimiter.
> * is the field delimiter within a segment.
> : is the subfield delimiter within a field.
> 
> (For reference, the 15 MB input has 643k segments, 1 MB input has 44k
> segments, and .25 MB input has 11k segments.)
> 
> The first field of a segment is the header that says what kind of segment it 
> is.
> Segments can be grouped together in loops that can be repeated.
> 
> The hierarchy of input goes like this:
> 
> Document 1
> - Group 1a
> - - Group 2a
> - - - Group 3a
> - - - - Group 4a
> - - - - Group 4b
> - - - Group 3b
> - - - - Group 4a
> - - - - Group 4b
> - - Group 2b
> ... etc.
> - Group 1b
> .. etc.
> Document 2
> ... etc.
> 
> Each group can be repeated any number of times below its parent group.
> Each group has a specification of certain beginning segments, loops of
> segments, sub groups, and ending segments.
> 
> It's positively atrocious. :)
> 
> The reason I am attempting to use parslet is:
> 
> 1) Need the output to be in a hierarchal hash that matches the input
> hierarchy. - Parslet output is this already!
> 2) Need to be able to convert every instance of certain segments into
> certain formats - Transform works great!
> 3) Need to be able to handle dirty input that does not follow the
> spec. Other solutions out there for these file types either require
> that data follow the specification and proper segment order or are
> cumbersome to customize.
> 4) Needs to be in Ruby.
> 
> So I wrote rules that define fields and segments, with which I wrote
> rules to define each segment.
> Using those segment rules, I can quite nicely define a parser for EDI
> 837 input. A rule looks like this:
> 
>  rule(:entity) do
>    nm1.as(:_nm1).as(:name) >>
>    address.as(:address).maybe >>
>    dates.maybe >>
>    dmg.as(:_dmg).as(:demographics).maybe >>
>    prv.as(:_prv).as(:speciality).maybe >>
>    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
>    per.as(:_per).repeat.as(:contact).maybe
>  end
> 
> address/dates are rules for groups of segments.
> nm1, dmg, prv, ref, per are all specific segment rules, which are
> defined like this:
> 
> rule(:nm1) { str('NM1').as(:id) >> fields }
> rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
> rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }
> 
> Everything except the first segment in a rule is .maybe because it
> might not be there.
> When I need to handle a segment that doesn't follow the spec, I can
> write a new :entity rule (for example) that includes a different set
> of segments, and ta-da! Working parser!
> 
> Splitting the input and parsing it top-level group at a time, and sort
> of building the hash hierarchy myself, I can keep the memory usage
> down.
> 
> I am trying to find ways that maybe I can write the rules better, but
> I might just have to parse smaller bits of it at a time. Any tips or
> ideas you have, I'll try.
> 
> Hope I didn't bore anyone too much. :)
> 
> -mj
> 
> On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <[email protected]> wrote:
>> Hei Melissa,
>> 
>> In short: Please, give us something to work with, we'll try to improve!
>> 
>> kaspar
>>

Re: [ruby.parslet] parsing large input

Reply via email to