Steve Lawrence already pointed out that we have streaming APIs designed for large data consisting of many relatively small data items. Whether that meets your use case or not I don't know.
General streaming or "Event oriented processing" - where the parser calls back the application as each "element" is parsed, is on our roadmap for the future. (Alternative style where the application pulls each event, or advances a cursor across the data is similar. Both styles of APIs are popular.) This has the advantage of eliminating the expanded XML or JSON intermediate data file, and also eliminates the large memory footprint of the JVM process running Daffodil - which today is keeping the entire parsed data infoset in memory before writing any of it out. Streaming complicates error handling i.e., the application doesn't know the data will even parse correctly until it has already started processing parts of it. But in many cases that won't be an impediment, as broken data files aren't common. And.... alas Daffodil doesn't have this yet. It is pretty high priority for our users generally. You're not the only one asking for it. I should mention that IBM's DFDL implementation *does* have streaming behavior, but also is a subset of full DFDL, and has a different feature coverage than Daffodil, and of course commercial license terms. In addition, however, some formats just don't stream well in the unparse direction, the parse direction, or both - one reason is that they have references like stored lengths for large blocks of data, etc. In the case where the data doesn't stream well, you simply need a dense representation of the data, so you can tolerate storing it in files. And given that the above described streaming behavior isn't an option on Daffodil today, you doubly need this denser storage. There have been efforts to standardize denser binary representations for XML, which take out all the tag redundancy and such. EXI is one such "standard" format. There's an open-source Apache-license EXI thing called Nagasena http://openexi.sourceforge.net/. I haven't used it, so can't recommend it specifically, but they do argue for why it's better than just using gzip. There may be other EXI tools as well if you search around. If you try EXI, I'd very much like to hear of your experience with it. One possible quite easy enhancement to Daffodil would be the ability to directly output EXI. A Daffodil "InfosetOutputter" could construct EXI instead of XML. And,... of course just always storing your xml as ".xml.gz" (or JSON as .json.gz) even as it is first created, can help quite a bit as well. Depends on whether the compute burden of gzip/gunzip is acceptable for you. I hope that was helpful. -mike beckerle ________________________________ From: Roberts, Amy L2 <[email protected]> Sent: Friday, October 19, 2018 6:11:14 PM To: [email protected] Subject: large intermediate data files Hello Daffodil Community, I'm interested in using Daffodil to access similar scientific data that's stored in files with several different binary encodings. We've been successful in describing the data format with the DFDL specification, but so far the way we've been using Daffodil is to parse the binary data into an xml or json file. The issue here is that the binary data files are ~2 GB, and both the xml and json versions are more than an order of magnitude larger. And analysis can involve anywhere from 1 to 100 files. Has anyone used Daffodil to unparse data in a way that doesn't create an intermediate file? I'm imagining something that allows me to use Daffodil to look up and stream relevant data; I thought I'd reach out and see if this is a common use case, not possible, or somewhere in between. Thank you! Amy
