Sure.
So your element has lengthKind 'pattern', so the length of the content region is determined by what matches the pattern. hello00 The terminator is in the framing, which is, according to DFDL's data grammar, outside of the content region. So the length pattern gives the string hello00 because the pattern match doesn't include the lookahead of one NUL followed by a nonNull. Then it will start to look for the terminator AFTER all of that content. This should find the remaining NUL. The hello00 will then be the content, which is converted to a value. So if trimming is enabled, the value will be trimmed by the padChar to just hello, and since the type is xs:string, we're done. Key to understanding this is that framing surrounds content, content surrounds value. Delimiters are part of the framing. Padding is part of the content. The length determines the start and end of the content. Only when lengthKind='delimited' is the terminator used to determine the content, and even in that case, the terminator that is found in the data stream isn't part of the content, it is part of the framing that is after the content. ________________________________ From: Costello, Roger L. <[email protected]> Sent: Tuesday, April 2, 2019 10:59:48 AM To: [email protected] Subject: Re: Question about parsing binary input containing strings separated by nulls Thank you Mike and Brandon! I’d like to follow up with a few specific questions, please. I’d like to ask some specific questions about this: <xs:element name="string" type="xs:string" maxOccurs="unbounded" dfdl:lengthKind="pattern" dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))" dfdl:representation="text" dfdl:encoding="ISO-8859-1" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%NUL;" dfdl:textStringJustification="left" dfdl:terminator="%NUL;"/> Suppose the input is the following (I’ll use 0 to represent the null symbol): Hello000World… I’d like to focus on consuming the first string (Hello). The lengthPattern says to consume characters up to the last null that precedes a non-null character. So, that results in this: Hello00 The terminator says to consume characters up to the first null. So, that results in this: Hello So, the lengthPattern and the terminator result in different strings, right? Aren’t they contradicting each other? How do the padding properties (textTrimKind, textStringPadCharacter, textStringJustification) come into play? Would you give me some intuition about how Daffodil uses these seemingly contradictory properties to parse the input, please? /Roger From: Beckerle, Mike <[email protected]> Sent: Tuesday, April 2, 2019 9:46 AM To: [email protected] Subject: [EXT] Re: Question about parsing binary input containing strings separated by nulls Roger, When you say "the input contains an unbounded number of strings, each string is padded by one or more nulls or ends at the end-of-file" that sounds simple, but there's lots of ambiguities. 1) are NULs also allowed at the end of file, or is it either/or strictly. 2) if a value is the empty string, then when unparsed I cannot distinguish consecutive pad characters delimiting a single value from the representation of several empty strings, each with its own single pad character after it. Are empty strings even allowed? Syntactically illegal?, or are they just invalid values? (I think this is the same point Brandon made.) 3) you have omitted whether there is any way for a NUL to appear in the content via some escaping mechanism. In the absence of such a statement, one can assume you mean "no it cannot", but consider that in many/most delimited formats there is such kinds of escaping/quoting to allow delimiters to appear in the content, but delimiters must be distinguished from padding for that to even make sense, otherwise you'd have to escape each padding character, which .... just isn't padding. 4) you are assuming a definition of the term "padding" that might seem intuitive to you, but doesn't match usage of the term in DFDL. Of course people have legitimately different intuitive definitions for these sorts of terms, so DFDL has to pick exactly what it means and we hope it matches most people's intuition, or that it is learn-able without too much trouble. The DFDL Working group team didn't invent DFDL's versions out of thin air. They were drawn from examples in existing format description mechanisms of data integration tools. I.e., they had workable precedent. DFDL is very specific about the term padding and delimiter. A delimiter is part of the framing which bounds the start or end of the content region. When lengthKind="delimited" the delimiter is used algorithmically to isolate the length of the content region. (When lengthKind is NOT delimited, the delimiter can still be used as a redundant marker of the start/end of the content region. Lots of data has this redundancy.) Padding is about characters within the content region that are outside of the value region. This is a simplification of the grammar rules in section 9 of the DFDL spec, but the grammar productions for DFDL-described data are roughly: simpleElement = preFraming simpleContent postFraming simpleContent = prePadding SimpleValue postPadding SimpleValue is a terminal of the grammar. It is the region whose characters/bits are converted into the value. There are two corresponding functions in the DFDL Expression language (sometimes called DPath since it's mostly like XPath): dfdl:valueLength(node, units) and dfdl:contentLength(node, units). The stored length in binary data is usually the content length, and is computed in terms of the value length, and per the grammar above, includes the length of any padding. Delimiters are in the pre/postFraming along with alignment regions and some other details. Note that use of both padding AND delimiters is very atypical of most data. Most data uses padding when the strings are specified (fixed or expression) length, meaning there can be ZERO pad characters if the data value consumes all the space. As for why you can't use dfdl:lengthKind="delimited" and somehow say "one or more NUL characters". This use case just isn't prevalent enough in textual data sets to be worth it. Or didn't seem to be at the time DFDL was being formulated. The way of specifying delimiters in DFDL was intentionally kept simpler than full regular expressions, because users have terrible trouble with regular expressions. The dfdl:lengthKind='pattern' lets you open this door and use full regular expressions. The limited functionality of dfdl:lengthKind="delimited" with the character class entities WSP, WSP*, WSP+, NL, etc. is sufficient to express many common textual data formats. These DFDL capabilities were adopted from examining the data-format description capabilities of a number of industry data integration tools. I won't claim that process is/was perfect but it was very rational and largely based on generalizing from existing practice. Some important format needs were missed and have since been added to DFDL (e.g., dfdl:bitOrder property came to DFDL quite late, but is needed for a bunch of formats.) We've also definitely found a need for what I call LSP (intra-linear space i.e., WSP without the line-endings). So that will get added at some point. Long winded per my usual, sorry. ...mike beckerle ________________________________ From: Costello, Roger L. <[email protected]<mailto:[email protected]>> Sent: Monday, April 1, 2019 2:47 PM To: [email protected]<mailto:[email protected]> Subject: Question about parsing binary input containing strings separated by nulls Hello DFDL community, My binary input file contains: string null(s) string null(s) …. The following DFDL schema correctly parses the input file: <xs:element name="input"> <xs:complexType> <xs:sequence> <xs:element name="string" type="xs:string" maxOccurs="unbounded" dfdl:lengthKind="pattern" dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))" dfdl:representation="text" dfdl:encoding="ISO-8859-1" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%NUL;" dfdl:textStringJustification="left" dfdl:terminator="%NUL;"/> </xs:sequence> </xs:complexType> </xs:element> But why do I need dfdl:lengthPattern? Why can’t I simply state this: the input contains an unbounded number of strings, each string is padded by one or more nulls or ends at the end-of-file. Why can’t I throw out dfdl:lengthPattern and set dfdl:lengthKind to “delimited”? Why doesn’t the following work correctly? <xs:element name="input"> <xs:complexType> <xs:sequence> <xs:element name="string" type="xs:string" maxOccurs="unbounded" dfdl:lengthKind="delimited" dfdl:representation="text" dfdl:encoding="ISO-8859-1" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%NUL;" dfdl:textStringJustification="left" dfdl:terminator="%NUL;"/> </xs:sequence> </xs:complexType> </xs:element> /Roger
