Correct about strings of ***** being a bunch of empty fields. There is always a certain segment that begins a new level, or group of segments. Here are all the important segment names and their hierarchy: https://gist.github.com/1420503
The example I pasted is actually the beginning of a transaction, which is already a few nested levels deep. Here I structured the example from my previous email in a similar way: https://gist.github.com/1420518 -mj On Thu, Dec 1, 2011 at 5:23 PM, Chris Corbyn <[email protected]> wrote: > Hi Melissa, > > So is this example one "document", where "ST" is always the header that > starts the document and each subsequent header inside it starts a new group > within the document? Presumably the parts with long strings of ***** means > there are empty fields in that segment. Trying to get my head around how > that hierarchy looks in practice. Can you add whitespace to the actual > input, for display purposes only, to indicate where this grouping and nesting > is occurring in that excerpt you posted? > > Cheers, > > Chris > > > On 02/12/2011, at 6:35 AM, Melissa Whittington wrote: > >> Sorry, I was out on vacation for a while. Back now. :) >> >> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2 >> GB total). >> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete. >> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds >> to complete. >> I have not tried treetop. >> >> The parser is for EDI files, specifically EDI 837 for now. Input looks >> like this: >> >> ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD >> HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA >> AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD >> HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~ >> NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD >> LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~ >> >> Three delimiters, ~, *, and :. >> ~ is the segment delimiter. >> * is the field delimiter within a segment. >> : is the subfield delimiter within a field. >> >> (For reference, the 15 MB input has 643k segments, 1 MB input has 44k >> segments, and .25 MB input has 11k segments.) >> >> The first field of a segment is the header that says what kind of segment it >> is. >> Segments can be grouped together in loops that can be repeated. >> >> The hierarchy of input goes like this: >> >> Document 1 >> - Group 1a >> - - Group 2a >> - - - Group 3a >> - - - - Group 4a >> - - - - Group 4b >> - - - Group 3b >> - - - - Group 4a >> - - - - Group 4b >> - - Group 2b >> ... etc. >> - Group 1b >> .. etc. >> Document 2 >> ... etc. >> >> Each group can be repeated any number of times below its parent group. >> Each group has a specification of certain beginning segments, loops of >> segments, sub groups, and ending segments. >> >> It's positively atrocious. :) >> >> The reason I am attempting to use parslet is: >> >> 1) Need the output to be in a hierarchal hash that matches the input >> hierarchy. - Parslet output is this already! >> 2) Need to be able to convert every instance of certain segments into >> certain formats - Transform works great! >> 3) Need to be able to handle dirty input that does not follow the >> spec. Other solutions out there for these file types either require >> that data follow the specification and proper segment order or are >> cumbersome to customize. >> 4) Needs to be in Ruby. >> >> So I wrote rules that define fields and segments, with which I wrote >> rules to define each segment. >> Using those segment rules, I can quite nicely define a parser for EDI >> 837 input. A rule looks like this: >> >> rule(:entity) do >> nm1.as(:_nm1).as(:name) >> >> address.as(:address).maybe >> >> dates.maybe >> >> dmg.as(:_dmg).as(:demographics).maybe >> >> prv.as(:_prv).as(:speciality).maybe >> >> ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >> >> per.as(:_per).repeat.as(:contact).maybe >> end >> >> address/dates are rules for groups of segments. >> nm1, dmg, prv, ref, per are all specific segment rules, which are >> defined like this: >> >> rule(:nm1) { str('NM1').as(:id) >> fields } >> rule(:fields) { field.repeat.as(:fields) >> segment_delimiter } >> rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field) } >> >> Everything except the first segment in a rule is .maybe because it >> might not be there. >> When I need to handle a segment that doesn't follow the spec, I can >> write a new :entity rule (for example) that includes a different set >> of segments, and ta-da! Working parser! >> >> Splitting the input and parsing it top-level group at a time, and sort >> of building the hash hierarchy myself, I can keep the memory usage >> down. >> >> I am trying to find ways that maybe I can write the rules better, but >> I might just have to parse smaller bits of it at a time. Any tips or >> ideas you have, I'll try. >> >> Hope I didn't bore anyone too much. :) >> >> -mj >> >> On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <[email protected]> wrote: >>> Hei Melissa, >>> >>> In short: Please, give us something to work with, we'll try to improve! >>> >>> kaspar >>> >
