Not a problem. But it is a bit tricky to understand. The "%NUL;" notation lets the schema talk about NUL characters in the data without having any NUL characters in the schema.
The %NUL; does not turn into a NUL character in the DFDL schema (which is an XML document). It remains the 5 characters "%NUL;" in the schema. Since a DFDL Schema is itself an XML document, it has a XML Infoset. That Infoset does not contain a NUL character. It contains 5 character strings "%NUL;" to talk about NUL characters that appear in the data. Now, if you use Apache Daffodil to convert data to XML, and the data contains NUL characters, not only as delimiters, but right in the strings of data, then we have to create XML and the XML we create needs to represent that the underlying data had a NUL in it. This is not a problem for DFDL or DFDL processors as the DFDL Infoset is different from the XML Infoset and one way it is different is that the DFDL Infoset is explicitly allowed to contain NUL and in fact is allowed to contain all the other characters XML 1.0 disallows. But that leaves the issue of what happens when the DFDL Infoset is converted into say, an XML Infoset, or a JSON Infoset representation. A string which has a NUL character in the middle of it must somehow represent that NUL, but in XML, it cannot simply pass through the NUL, as XML doesn't allow NUL. This problem is not at all unique to DFDL. Lots of software has had to cope with this XML restriction. DFDL (the language) does not take any position on how DFDL implementations cope with this. This is probably a flaw in the specification, as standard conversions to/from common infoset representations should probably also be standardized. Perhaps that will happen in the future. Apache Daffodil uses a strategy adapted from what we found many other pieces of software use. E.g., we do something similar to what Microsoft has published as what they do for Microsoft Visio. Apache Daffodil utilizes the Unicode Private Use Area (PUA) characters, which are all legal as characters of XML documents, and converts characters that are illegal in XML 1.0 to corresponding characters in the Unicode Private Use Area. This is a bijection, so that on unparsing the inverse mapping applies and is one-to-one. A character is projected into the private use area by adding 0xE000 to its unicode character code. So NUL (0) becomes 0xE000. Ctrl-A (which is unicode codepoint 1) becomes 0xE001. And so on. A different mapping is used for other disallowed Unicode characters such as isolated surrogate halves. Etc. The gory details are here: https://daffodil.apache.org/infoset/ In the section called "XML Illegal Characters". A problem remains that the fonts people use to look at data on their computer screens often lack glyphs for the Unicode PUA characters, so 0xE000 character often just shows up as a box with no ability to distinguish whether that box represents 0xE000, or 0xE001, or 0xE002, etc. But the underlying string of XML does have distinct code points in it. So this "on screen printed" represntation is only a loss of information for a reader's eyes. XML Infoset data saved to a file will have the distinct character codes 0xE000, or 0xE001, etc. This is one of the reasons Apache Daffodil says you must use a computer set up for Unicode to use Daffodil, because even if all your data is US-ASCII, NUL is a legal US-ASCII character, and NUL will end up as an 0xE000 character, which cannot be depicted without Unicode font support. This headache is one of the ones the DFDL workgroup accepted when we chose XML Schema and the XML Infoset as starting points for DFDL way back in year 2001. JSON has a different set of restrictions on characters. E.g., a string in DFDL cannot contain a line-ending character unless it is converted to an escape such as "\n". I suspect we're missing documentation of the conversion of the DFDL Infoset to the JSON Infoset. (Created: https://issues.apache.org/jira/browse/DAFFODIL-2621) On Mon, Jan 10, 2022 at 9:48 AM Roger L Costello <[email protected]> wrote: > Hi Folks, > > XML Schema hosts the DFDL language, i.e., DFDL properties are added into > an XML Schema. > > XML Schema is XML. > > XML does not allow (among others) the NUL character. That means I cannot > directly copy (from somewhere) a NUL character and paste it into an XML > document, nor can I indirectly use the NUL character via XML's character > entity mechanism. > > DFDL allows the NUL character via the DFDL entity %NUL; > > Isn't this a problem? Isn't DFDL violating a basic rule of XML? > > /Roger > >
