If no pre-built solution exists, writing your own would not be that difficult. I suggest looking at a parser combinator such as FastParse to create your own.
http://www.lihaoyi.com/fastparse/ Regards, Kurt On Tue, Mar 13, 2018 at 7:47 AM Aakash Basu <aakash.spark....@gmail.com> wrote: > Thanks again for the detailed explanation, would like to go through. > > In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt > *data mapping it with the respective standards and then building a JSON > (so XML doesn't comes into the picture), containing the following (small > example of EDI) - > > ISA*00* *00* *ZZ*D00XXX *ZZ*00AA > *070305*1832*^*00501*676048320*0*P*\~ > GS*BE*D00XXX*00AA*20150305*1832*260007982*X*005010X220A1~ > ST*834*0001*005010X220A1~ > BGN*00*88880070301 00*20150305*181245****4~ > DTP*007*D8*20150301~ > N1*P5*PAYER 1*FI*999999999~ > N1*IN*KCMHSAS*FI*999999999~ > INS*Y*18*030*XN*A*C **FT~ > REF*0F*00389999~ > REF*1L*000003409999~ > REF*3H*K129999A~ > DTP*356*D8*20150301~ > NM1*IL*1*DOE*JOHN*A***34*999999999~ > N3*777 ELM ST~ > N4*ALLEGAN*MI*49010**CY*03~ > DMG*D8*19670330*M**O~ > LUI***ESSPANISH~ > HD*030**AK*064703*IND~ > DTP*348*D8*20150301~ > AMT*P3*45.34~ > REF*17*E 1F~ > SE*20*0001~ > GE*1*260007982~ > IEA*1*676048320~ > > > > Thanks, > Aakash. > > On Tue, Mar 13, 2018 at 6:37 PM, Darin McBeath <ddmcbe...@yahoo.com> > wrote: > >> I'm not familiar with EDI, but perhaps one option might be >> spark-xml-utils (https://github.com/elsevierlabs-os/spark-xml-utils). >> You could transform the XML to the XML format required by the xml-to-json >> function and then return the json. Spark-xml-utils wraps the open source >> Saxon project and supports XPath, XQuery, and XSLT. Spark-xml-utils >> doesn't parallelize the parsing of an individual document, but if you have >> your documents split across a cluster, the processing can be parallelized. >> We use this package extensively within our company to process millions of >> XML records. If you happen to be attending Spark summit in a few months, >> someone will be presenting on this topic ( >> https://databricks.com/session/mining-the-worlds-science-large-scale-data-matching-and-integration-from-xml-corpora >> ). >> >> >> Below is a snippet for xquery. >> >> let $retval := >> <map> >> <string key="doi">{$doi}</string> >> <string key="cid">{$cid}</string> >> <string key="pii">{$pii}</string> >> <string key="contentType">{$content-type}</string> >> <string key="srctitle">{$srctitle}</string> >> <string key="documentType">{$document-type}</string> >> <string key="documentSubtype">{$document-subtype}</string> >> <string key="publicationDate">{$publication-date}</string> >> <string key="articleTitle">{$article-title}</string> >> <string key="issn">{$issn}</string> >> <string key="isbn">{$isbn}</string> >> <string key="lang">{$lang}</string> >> {$tables} >> </map> >> >> return xml-to-json($retval) >> >> >> Darin. >> >> On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu < >> aakash.spark....@gmail.com> wrote: >> >> >> Hi Jörn, >> >> Thanks for a quick revert. I already built a EDI to JSON parser from >> scratch using the 811 and 820 standard mapping document. It can run on any >> standard and for any type of EDI. But my built is in native python and >> doesn't leverage Spark's parallel processing, which I want to do for large >> and huge amount of EDI data. >> >> Any pointers on that? >> >> Thanks, >> Aakash. >> >> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >> Maybe there are commercial ones. You could also some of the open source >> parser for xml. >> >> However xml is very inefficient and you need to du a lot of tricks to >> make it run in parallel. This also depends on type of edit message etc. >> sophisticated unit testing and performance testing is key. >> >> Nevertheless it is also not as difficult as I made it sound now. >> >> > On 13. Mar 2018, at 10:36, Aakash Basu <aakash.spark....@gmail.com> >> wrote: >> > >> > Hi, >> > >> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON >> using Spark? >> > >> > Thanks, >> > Aakash. >> >> >> >