Re: [ruby.parslet] parsing large input

Melissa Whittington Thu, 01 Dec 2011 15:08:08 -0800

Correct about strings of ***** being a bunch of empty fields.

There is always a certain segment that begins a new level, or group of segments.
Here are all the important segment names and their hierarchy:
https://gist.github.com/1420503


The example I pasted is actually the beginning of a transaction, which
is already a few nested levels deep.
Here I structured the example from my previous email in a similar way:
https://gist.github.com/1420518

-mj

On Thu, Dec 1, 2011 at 5:23 PM, Chris Corbyn <[email protected]> wrote:
> Hi Melissa,
>
> So is this example one "document", where "ST" is always the header that 
> starts the document and each subsequent header inside it starts a new group 
> within the document?  Presumably the parts with long strings of ***** means 
> there are empty fields in that segment.  Trying to get my head around how 
> that hierarchy looks in practice.  Can you add whitespace to the actual 
> input, for display purposes only, to indicate where this grouping and nesting 
> is occurring in that excerpt you posted?
>
> Cheers,
>
> Chris
>
>
> On 02/12/2011, at 6:35 AM, Melissa Whittington wrote:
>
>> Sorry, I was out on vacation for a while. Back now. :)
>>
>> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
>> GB total).
>> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
>> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
>> to complete.
>> I have not tried treetop.
>>
>> The parser is for EDI files, specifically EDI 837 for now. Input looks
>> like this:
>>
>> ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
>> HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
>> AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
>> HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
>> NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
>> LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~
>>
>> Three delimiters, ~, *, and :.
>> ~ is the segment delimiter.
>> * is the field delimiter within a segment.
>> : is the subfield delimiter within a field.
>>
>> (For reference, the 15 MB input has 643k segments, 1 MB input has 44k
>> segments, and .25 MB input has 11k segments.)
>>
>> The first field of a segment is the header that says what kind of segment it 
>> is.
>> Segments can be grouped together in loops that can be repeated.
>>
>> The hierarchy of input goes like this:
>>
>> Document 1
>> - Group 1a
>> - - Group 2a
>> - - - Group 3a
>> - - - - Group 4a
>> - - - - Group 4b
>> - - - Group 3b
>> - - - - Group 4a
>> - - - - Group 4b
>> - - Group 2b
>> ... etc.
>> - Group 1b
>> .. etc.
>> Document 2
>> ... etc.
>>
>> Each group can be repeated any number of times below its parent group.
>> Each group has a specification of certain beginning segments, loops of
>> segments, sub groups, and ending segments.
>>
>> It's positively atrocious. :)
>>
>> The reason I am attempting to use parslet is:
>>
>> 1) Need the output to be in a hierarchal hash that matches the input
>> hierarchy. - Parslet output is this already!
>> 2) Need to be able to convert every instance of certain segments into
>> certain formats - Transform works great!
>> 3) Need to be able to handle dirty input that does not follow the
>> spec. Other solutions out there for these file types either require
>> that data follow the specification and proper segment order or are
>> cumbersome to customize.
>> 4) Needs to be in Ruby.
>>
>> So I wrote rules that define fields and segments, with which I wrote
>> rules to define each segment.
>> Using those segment rules, I can quite nicely define a parser for EDI
>> 837 input. A rule looks like this:
>>
>>  rule(:entity) do
>>    nm1.as(:_nm1).as(:name) >>
>>    address.as(:address).maybe >>
>>    dates.maybe >>
>>    dmg.as(:_dmg).as(:demographics).maybe >>
>>    prv.as(:_prv).as(:speciality).maybe >>
>>    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
>>    per.as(:_per).repeat.as(:contact).maybe
>>  end
>>
>> address/dates are rules for groups of segments.
>> nm1, dmg, prv, ref, per are all specific segment rules, which are
>> defined like this:
>>
>> rule(:nm1) { str('NM1').as(:id) >> fields }
>> rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
>> rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }
>>
>> Everything except the first segment in a rule is .maybe because it
>> might not be there.
>> When I need to handle a segment that doesn't follow the spec, I can
>> write a new :entity rule (for example) that includes a different set
>> of segments, and ta-da! Working parser!
>>
>> Splitting the input and parsing it top-level group at a time, and sort
>> of building the hash hierarchy myself, I can keep the memory usage
>> down.
>>
>> I am trying to find ways that maybe I can write the rules better, but
>> I might just have to parse smaller bits of it at a time. Any tips or
>> ideas you have, I'll try.
>>
>> Hope I didn't bore anyone too much. :)
>>
>> -mj
>>
>> On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <[email protected]> wrote:
>>> Hei Melissa,
>>>
>>> In short: Please, give us something to work with, we'll try to improve!
>>>
>>> kaspar
>>>
>

Re: [ruby.parslet] parsing large input

Reply via email to