This email thread for discussion of EXI capabilities for Daffodil. The primary requirement is improved performance by avoiding the processing and size overhead of XML (or JSON) textual infoset output creation from parsing, and input to unparsing. Users want to process large binary data files (think 800GBytes) using Daffodil. Textual XML can blow up the size of binary data by a factor of 100, which is infeasible both space and processing-overhead wise for large input data files like this.
Even for small data messages the overhead of XML text can be excessive and have major performance impact. Users of Daffodil need to be able to create applications that never realize textual XML in processing pipelines that parse data, transform it using XSLT, validate it using XSD validation and/or schematron validation, and unparse back to original format. Keeping the data as EXI as it moves between these kinds of processing should provide substantial performance benefits. Phased approach: I believe the requirements can be done in phases, e.g., I would be fine with requiring a specific open-source compatible EXI library in our CLI as a first version even though ultimately we want it to be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone to schema-aware EXI. Theoretically, at least, there is no need for Daffodil to support EXI directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in theory, just be the creation of a couple of example applications of Daffodil and an EXI library using each from their APIs. In practice there may be changes to Daffodil needed because: * Daffodil APIs may need change to make use of various EXI libraries possible or smoother/easier. * CLI may want to expose EXI capability for easy user experience with it. * Daffodil's unparser SAX API has some overhead we may want to bypass. The unparser is naturally a pull/StAX style of API. If EXI libraries can accommodate this then that may be substantially better in performance. EXI is all about performance after all. Some requirements: 1) support for multiple open and closed source EXI implementations that are not incorporated into Daffodil as dependencies I know we have users who want to see tests with at least Agile Delta EXI (closed source) and EXIfficient. 2) support for schema-unaware EXI encoding 3) support for schema-ware EXI encoding. This may introduce new requirements - e.g., unlike XML text or schema unaware, one may (I have a lack of EXI knowledge/experience here) need the schema (or some EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD here.) 4) ? TDML runner support (? is there any requirement here ? Unclear) 5) CLI support to output schema-unaware exi. 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI? Applications can do this from API, do we really also need to offer it from the CLI?) 6) Enable EXI LZW Compression feature (or not) - EXI is all about performance by improving the data density hence the handling overhead. We should do experiments measuring the on/off of options such as compression (a LZW-style compression feature built into EXI encoders/decoders) which is optionally enabled. If this improves compression with low overhead we would just turn it on. If the benefits are small we would not bother with it, but... if it reduces size substantially, but has real measurable cost, then we probably need a switch for on/off. An interesting point would be the use of LZW compression with non-schema aware EXI vs. schema-aware EXI (with or without compression). 7) Unparser - API Pull support - Speculation here - do we need to create a standard StAX API for Daffodil unparsing so that EXI software supporting StAX (or any other kind of StAX software) can be used with Daffodil more easily. 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of Daffodil) should show how to parse, transform (simple XSLT thing), and unparse data using Daffodil with EXI as the intermediate form between the parse and transform, and between the transform and unparse. This should be shown in schema-unaware and schema-aware variants. An important part of this example is illustrating any added complexities that schema-aware EXI imposes. These are effectively EXI versions of the openDFDL helloWorld example.