Re: EXI capability for Daffodil - requirements and design

Adams, Joshua Tue, 19 Jul 2022 08:06:22 -0700

Here are my notes based on my work on supporting Exificient so far:

> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.


I think this is definitely true.  Right now in my current pull request for 
adding Exificient to the CLI tool when I want to parse with Exificient using 
SAX it looks like this:

    val saxXmlRdr = processor.newXMLReaderInstance
    saxXmlRdr.setContentHandler(saxContentHandler)
    saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBDIRECTORY, blobDir)
    saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBSUFFIX, blobSuffix)
    saxXmlRdr.parse(data)

 I feel that it could be smoothed out greatly if we could do something like:

     processor.parseWithSAX(data, saxContentHandler, saxProperties)

 Similar improvements could be made on the unparse side of things as well.

 > * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.

I did some testing with the current SAX based approach for EXI in my pull 
request and there doesn't seem to be any difference in performance for parsing 
between EXI and regular XML infosets, but for unparsing EXI (using SAX) is 
about 3 times slower than normal XML.

Exificient has a StAX API as well so this is probably worth investigating.  
Does daffodil already support StAX or would we need to implement some sort of 
XMLStreamReader/XMLStreamWriter?

> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.

I'm looking into Agile Delta and have requested an evaluation copy through 
their website, but I'm not sure that will give us access to their SDK.  Should 
allow us to at least verify that we can unparse an EXI file encoded by Agile 
Delta with our current implementation using Exificient though.

> 2) support for schema-unaware EXI encoding

This is how it is currently implemented in my pull request

> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)

This hopefully wouldn't be too difficult to implement in the CLI, the only 
thing I'm not sure about is how well the EXI libraries would handle resolving 
our schemas spread out across several files.  Should be a solvable problem 
though.  I'm thinking it would only be supported when the --schema option is 
present on the CLI (i.e. use schema-unaware if using saved parsers).

> 4)  ? TDML runner support (? is there any requirement here ? Unclear)

Only thing that might be nice to have here is a way to compare EXI infosets, 
but I'm not sure if this is really necessary.  There isn't much value in 
inspecting an EXI infoset, so long as you can verify that it unparses 
correctly, matching the original file.

> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)

Touched on these earlier.  Should be doable, but might be limited to 
schema-unaware for saved-parsers

> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).

This would simply be a matter of setting the appropriate flags to the 
EXIFactory before creating the EXI SaxContentHandler.  Could easily be added to 
my existing pull request.

> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.

Touched on this earlier that Exificient does have a StAX API and is worht 
investigating due to the performance overhead of SAX when unparsing.

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)

* CLI exi feature must support both single-file parse/unparse, and
streaming mode.

Exificient does have code in their sample program for streaming, but I'm not 
sure how easily it would integrate into our CLI tool without looking into it 
further.  Not to mention the NUL separator issue.

Josh
________________________________
From: Mike Beckerle <mbecke...@apache.org>
Sent: Monday, July 18, 2022 4:39 PM
To: dev@daffodil.apache.org <dev@daffodil.apache.org>
Subject: Re: EXI capability for Daffodil - requirements and design

Additional requirements:

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)

* CLI exi feature must support both single-file parse/unparse, and
streaming mode.
Note that this raises question how messages are separated on the stream, if
at all. I know our streaming mode now uses NUL bytes between XML text
outputs from the parser, and the streaming unparser expects these NUL
bytes. This NUL between messages might need to become configurable, for
equivalence of XML-text streams with corresponding EXI streams.


On Mon, Jul 18, 2022 at 4:16 PM Mike Beckerle <mbecke...@apache.org> wrote:

> This email thread for discussion of EXI capabilities for Daffodil.
>
> The primary requirement is improved performance by avoiding the processing
> and size overhead of XML (or JSON) textual infoset output creation from
> parsing, and input to unparsing.
> Users want to process large binary data files (think 800GBytes) using
> Daffodil. Textual XML can blow up the size of binary data by a factor of
> 100, which is infeasible both space and processing-overhead wise for large
> input data files like this.
>
> Even for small data messages the overhead of XML text can be excessive and
> have major performance impact.
>
> Users of Daffodil need to be able to create applications that never
> realize textual XML in processing pipelines that parse data, transform it
> using XSLT, validate it using XSD validation and/or schematron validation,
> and unparse back to original format. Keeping the data as EXI as it moves
> between these kinds of processing should provide substantial performance
> benefits.
>
> Phased approach: I believe the requirements can be done in phases, e.g., I
> would be fine with requiring a specific open-source compatible EXI
> library in our CLI as a first version even though ultimately we want it to
> be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
> to schema-aware EXI.
>
> Theoretically, at least, there is no need for Daffodil to support EXI
> directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
> theory, just be the creation of a couple of example applications of
> Daffodil and an EXI library using each from their APIs.
>
> In practice there may be changes to Daffodil needed because:
>
> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.
> * CLI may want to expose EXI capability for easy user experience with it.
> * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.
>
> Some requirements:
>
> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.
>
> 2) support for schema-unaware EXI encoding
>
> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)
>
> 4)  ? TDML runner support (? is there any requirement here ? Unclear)
>
> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)
>
> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).
>
> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.
>
> 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
> Daffodil) should show how to parse, transform (simple XSLT thing), and
> unparse data using Daffodil with EXI as the intermediate form between the
> parse and transform, and between the transform and unparse. This should be
> shown in schema-unaware and schema-aware variants. An important part of
> this example is illustrating any added complexities that schema-aware EXI
> imposes. These are effectively EXI versions of the openDFDL helloWorld
> example.
>
>
>
>
>

Re: EXI capability for Daffodil - requirements and design

Reply via email to