Re: EXI capability for Daffodil - requirements and design

Mike Beckerle Mon, 18 Jul 2022 13:39:56 -0700

Additional requirements:

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)


* CLI exi feature must support both single-file parse/unparse, and
streaming mode.
Note that this raises question how messages are separated on the stream, if
at all. I know our streaming mode now uses NUL bytes between XML text
outputs from the parser, and the streaming unparser expects these NUL
bytes. This NUL between messages might need to become configurable, for
equivalence of XML-text streams with corresponding EXI streams.


On Mon, Jul 18, 2022 at 4:16 PM Mike Beckerle <mbecke...@apache.org> wrote:

> This email thread for discussion of EXI capabilities for Daffodil.
>
> The primary requirement is improved performance by avoiding the processing
> and size overhead of XML (or JSON) textual infoset output creation from
> parsing, and input to unparsing.
> Users want to process large binary data files (think 800GBytes) using
> Daffodil. Textual XML can blow up the size of binary data by a factor of
> 100, which is infeasible both space and processing-overhead wise for large
> input data files like this.
>
> Even for small data messages the overhead of XML text can be excessive and
> have major performance impact.
>
> Users of Daffodil need to be able to create applications that never
> realize textual XML in processing pipelines that parse data, transform it
> using XSLT, validate it using XSD validation and/or schematron validation,
> and unparse back to original format. Keeping the data as EXI as it moves
> between these kinds of processing should provide substantial performance
> benefits.
>
> Phased approach: I believe the requirements can be done in phases, e.g., I
> would be fine with requiring a specific open-source compatible EXI
> library in our CLI as a first version even though ultimately we want it to
> be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
> to schema-aware EXI.
>
> Theoretically, at least, there is no need for Daffodil to support EXI
> directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
> theory, just be the creation of a couple of example applications of
> Daffodil and an EXI library using each from their APIs.
>
> In practice there may be changes to Daffodil needed because:
>
> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.
> * CLI may want to expose EXI capability for easy user experience with it.
> * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.
>
> Some requirements:
>
> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.
>
> 2) support for schema-unaware EXI encoding
>
> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)
>
> 4)  ? TDML runner support (? is there any requirement here ? Unclear)
>
> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)
>
> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).
>
> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.
>
> 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
> Daffodil) should show how to parse, transform (simple XSLT thing), and
> unparse data using Daffodil with EXI as the intermediate form between the
> parse and transform, and between the transform and unparse. This should be
> shown in schema-unaware and schema-aware variants. An important part of
> this example is illustrating any added complexities that schema-aware EXI
> imposes. These are effectively EXI versions of the openDFDL helloWorld
> example.
>
>
>
>
>

Re: EXI capability for Daffodil - requirements and design

Reply via email to