Re: large intermediate data files

Steve Lawrence Mon, 22 Oct 2018 05:48:01 -0700

Hi Amy,

Currently, there isn't really a good way to access Daffodil's internal
infoset representation and bypass output to XML/JSON (DAFFODIL-1717 is
one way we could support this in the future). You could potentially
write some Scala code to access the internals, but documentation is
limited and it isn't necessarily a stable API so might break with future
updates. If you do want to go this route, let us know and we can help
out--might eventually lead to an officially supported API to do what you
want.

However, if each file is really just the same thing repeated multiple
times, you could use the --stream option with the CLI. With this option,
Daffodil parses the data, and if it does not reach the end of data it
repeats the parse where the previous left off, continuing until it
finally does reach the end of data.

As an example, a CSV file is really just an unbounded repetition of
lines--we could model just a single line like this:

  <xs:element name="CSVLine" dfdl:terminator="%NL;">
    <xs:complexType>
      <xs:sequence dfdl:separator="," dfdl:separatorPosition="infix">
        <xs:element name="Item" type="xs:string" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

We can then parse an entire CSV file using just the description of a
single line with the --stream option, e.g.:

  daffodil parse --stream -s csvline.dfdl.xsd file.csv

If file.csv contained this data:

  row1_item1,row1_item2,row1_item3
  row2_item1,row2_item2,row2_item3
  row3_item1,row3_item2,row3_item3

The resulting output would be this:

  <?xml version="1.0" encoding="UTF-8" ?>
  <CSVLine>
    <Item>row1_item1</Item>
    <Item>row1_item2</Item>
    <Item>row1_item3</Item>
  </CSVLine>
  <?xml version="1.0" encoding="UTF-8" ?>
  <CSVLine>
    <Item>row2_item1</Item>
    <Item>row2_item2</Item>
    <Item>row2_item3</Item>
  </CSVLine>
  <?xml version="1.0" encoding="UTF-8" ?>
  <CSVLine>
    <Item>row3_item1</Item>
    <Item>row3_item2</Item>
    <Item>row3_item3</Item>
  </CSVLine>

In this case, there were three rows in the CSV file. Because of the
--stream option, three separate parses took place and three infosets
were output, with each infoset separated by a NUL character (0x0).

So rather than getting one giant chunk of XML that's 20GB at the end of
a parse, you instead get many smaller chunks of XML streamed out
separated by a NUL. Each chunk of XML is output at the end of each small
parse, and could be split on the NUL character and
processed/transformed/validated/etc. individually. This doesn't avoid
the creation of lots of XML/JSON, but does allow you to retrieve smaller
chunks in a streaming manner and filter out the pieces you don't need.

On 10/19/18 6:11 PM, Roberts, Amy L2 wrote:
> Hello Daffodil Community,
> 
> I'm interested in using Daffodil to access similar scientific data that's 
> stored 
> in files with several different binary encodings.
> 
> We've been successful in describing the data format with the DFDL 
> specification, 
> but so far the way we've been using Daffodil is to parse the binary data into 
> an 
> xml or json file.  The issue here is that the binary data files are ~2 GB, 
> and 
> both the xml and json versions are more than an order of magnitude larger.  
> And 
> analysis can involve anywhere from 1 to 100 files.
> 
> Has anyone used Daffodil to unparse data in a way that doesn't create an 
> intermediate file?
> 
> I'm imagining something that allows me to use Daffodil to look up and stream 
> relevant data; I thought I'd reach out and see if this is a common use case, 
> not 
> possible, or somewhere in between.
> 
> Thank you!
> 
> Amy
>

Re: large intermediate data files

Reply via email to