Re: large intermediate data files

Mike Beckerle Mon, 22 Oct 2018 07:27:03 -0700

Steve Lawrence already pointed out that we have streaming APIs designed for 
large data consisting of many relatively small data items. Whether that meets 
your use case or not I don't know.



General streaming or "Event oriented processing" - where the parser calls back 
the application as each "element" is parsed, is on our roadmap for the future. 
(Alternative style where the application pulls each event, or advances a cursor 
across the data is similar. Both styles of APIs are popular.)


This has the advantage of eliminating the expanded XML or JSON intermediate 
data file, and also eliminates the large memory footprint of the JVM process 
running Daffodil - which today is keeping the entire parsed data infoset in 
memory before writing any of it out.


Streaming complicates error handling i.e., the application doesn't know the 
data will even parse correctly until it has already started processing parts of 
it. But in many cases that won't be an impediment, as broken data files aren't 
common.


And.... alas Daffodil doesn't have this yet. It is pretty high priority for our 
users generally. You're not the only one asking for it.


I should mention that IBM's DFDL implementation *does* have streaming behavior, 
but also is a subset of full DFDL, and has a different feature coverage than 
Daffodil, and of course commercial license terms.


In addition, however, some formats just don't stream well in the unparse 
direction, the parse direction, or both - one reason is that they have 
references like stored lengths for large blocks of data, etc.


In the case where the data doesn't stream well, you simply need a dense 
representation of the data, so you can tolerate storing it in files. And given 
that the above described streaming behavior isn't an option on Daffodil today, 
you doubly need this denser storage.


There have been efforts to standardize denser binary representations for XML, 
which take out all the tag redundancy and such. EXI is one such "standard" 
format. There's an open-source Apache-license EXI thing called Nagasena 
http://openexi.sourceforge.net/. I haven't used it, so can't recommend it 
specifically, but they do argue for why it's better than just using gzip. There 
may be other EXI tools as well if you search around.


If you try EXI, I'd very much like to hear of your experience with it.


One possible quite easy enhancement to Daffodil would be the ability to 
directly output EXI. A Daffodil "InfosetOutputter" could construct EXI instead 
of XML.


And,... of course just always storing your xml as ".xml.gz" (or JSON as 
.json.gz) even as it is first created, can help quite a bit as well. Depends on 
whether the compute burden of gzip/gunzip is acceptable for you.


I hope that was helpful.


-mike beckerle






________________________________
From: Roberts, Amy L2 <[email protected]>
Sent: Friday, October 19, 2018 6:11:14 PM
To: [email protected]
Subject: large intermediate data files


Hello Daffodil Community,

I'm interested in using Daffodil to access similar scientific data that's 
stored in files with several different binary encodings.

We've been successful in describing the data format with the DFDL 
specification, but so far the way we've been using Daffodil is to parse the 
binary data into an xml or json file.  The issue here is that the binary data 
files are ~2 GB, and both the xml and json versions are more than an order of 
magnitude larger.  And analysis can involve anywhere from 1 to 100 files.

Has anyone used Daffodil to unparse data in a way that doesn't create an 
intermediate file?

I'm imagining something that allows me to use Daffodil to look up and stream 
relevant data; I thought I'd reach out and see if this is a common use case, 
not possible, or somewhere in between.

Thank you!

Amy

Re: large intermediate data files

Reply via email to