EXI capability for Daffodil - requirements and design

Mike Beckerle Mon, 18 Jul 2022 13:16:42 -0700

This email thread for discussion of EXI capabilities for Daffodil.

The primary requirement is improved performance by avoiding the processing
and size overhead of XML (or JSON) textual infoset output creation from
parsing, and input to unparsing.
Users want to process large binary data files (think 800GBytes) using
Daffodil. Textual XML can blow up the size of binary data by a factor of
100, which is infeasible both space and processing-overhead wise for large
input data files like this.


Even for small data messages the overhead of XML text can be excessive and
have major performance impact.

Users of Daffodil need to be able to create applications that never realize
textual XML in processing pipelines that parse data, transform it using
XSLT, validate it using XSD validation and/or schematron validation, and
unparse back to original format. Keeping the data as EXI as it moves
between these kinds of processing should provide substantial performance
benefits.

Phased approach: I believe the requirements can be done in phases, e.g., I
would be fine with requiring a specific open-source compatible EXI
library in our CLI as a first version even though ultimately we want it to
be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
to schema-aware EXI.

Theoretically, at least, there is no need for Daffodil to support EXI
directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
theory, just be the creation of a couple of example applications of
Daffodil and an EXI library using each from their APIs.

In practice there may be changes to Daffodil needed because:

* Daffodil APIs may need change to make use of various EXI libraries
possible or smoother/easier.
* CLI may want to expose EXI capability for easy user experience with it.
* Daffodil's unparser SAX API has some overhead we may want to bypass. The
unparser is naturally a pull/StAX style of API. If EXI libraries can
accommodate this then that may be substantially better in performance. EXI
is all about performance after all.

Some requirements:

1) support for multiple open and closed source EXI implementations that are
not incorporated into Daffodil as dependencies
I know we have users who want to see tests with at least Agile Delta EXI
(closed source) and EXIfficient.

2) support for schema-unaware EXI encoding

3) support for schema-ware EXI encoding. This may introduce new
requirements - e.g., unlike XML text or schema unaware, one may (I have a
lack of EXI knowledge/experience here) need the schema (or some
EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
here.)

4)  ? TDML runner support (? is there any requirement here ? Unclear)

5) CLI support to output schema-unaware exi.

5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
Applications can do this from API, do we really also need to offer it from
the CLI?)

6) Enable EXI LZW Compression feature (or not) - EXI is all about
performance by improving the data density hence the handling overhead. We
should do experiments measuring the on/off of options such as compression
(a LZW-style compression feature built into EXI encoders/decoders) which is
optionally enabled. If this improves compression with low overhead we would
just turn it on. If the benefits are small we would not bother with it,
but...  if it reduces size substantially, but has real measurable cost,
then we probably need a switch for on/off. An interesting point would be
the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
(with or without compression).

7) Unparser - API Pull support - Speculation here - do we need to create a
standard StAX API for Daffodil unparsing so that EXI software supporting
StAX (or any other kind of StAX software) can be used with Daffodil more
easily.

8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
Daffodil) should show how to parse, transform (simple XSLT thing), and
unparse data using Daffodil with EXI as the intermediate form between the
parse and transform, and between the transform and unparse. This should be
shown in schema-unaware and schema-aware variants. An important part of
this example is illustrating any added complexities that schema-aware EXI
imposes. These are effectively EXI versions of the openDFDL helloWorld
example.

EXI capability for Daffodil - requirements and design

Reply via email to