Thank you Mike and Brandon!
I'd like to follow up with a few specific questions, please.
I'd like to ask some specific questions about this:
<xs:element name="string" type="xs:string" maxOccurs="unbounded"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
dfdl:representation="text"
dfdl:encoding="ISO-8859-1"
dfdl:textTrimKind="padChar"
dfdl:textStringPadCharacter="%NUL;"
dfdl:textStringJustification="left"
dfdl:terminator="%NUL;"/>
Suppose the input is the following (I'll use 0 to represent the null symbol):
Hello000World...
I'd like to focus on consuming the first string (Hello).
The lengthPattern says to consume characters up to the last null that precedes
a non-null character. So, that results in this:
Hello00
The terminator says to consume characters up to the first null. So, that
results in this:
Hello
So, the lengthPattern and the terminator result in different strings, right?
Aren't they contradicting each other?
How do the padding properties (textTrimKind, textStringPadCharacter,
textStringJustification) come into play?
Would you give me some intuition about how Daffodil uses these seemingly
contradictory properties to parse the input, please?
/Roger
From: Beckerle, Mike <[email protected]>
Sent: Tuesday, April 2, 2019 9:46 AM
To: [email protected]
Subject: [EXT] Re: Question about parsing binary input containing strings
separated by nulls
Roger,
When you say "the input contains an unbounded number of strings, each string is
padded by one or more nulls or ends at the end-of-file" that sounds simple, but
there's lots of ambiguities.
1) are NULs also allowed at the end of file, or is it either/or strictly.
2) if a value is the empty string, then when unparsed I cannot distinguish
consecutive pad characters delimiting a single value from the representation of
several empty strings, each with its own single pad character after it. Are
empty strings even allowed? Syntactically illegal?, or are they just invalid
values?
(I think this is the same point Brandon made.)
3) you have omitted whether there is any way for a NUL to appear in the content
via some escaping mechanism. In the absence of such a statement, one can assume
you mean "no it cannot", but consider that in many/most delimited formats there
is such kinds of escaping/quoting to allow delimiters to appear in the content,
but delimiters must be distinguished from padding for that to even make sense,
otherwise you'd have to escape each padding character, which .... just isn't
padding.
4) you are assuming a definition of the term "padding" that might seem
intuitive to you, but doesn't match usage of the term in DFDL. Of course people
have legitimately different intuitive definitions for these sorts of terms, so
DFDL has to pick exactly what it means and we hope it matches most people's
intuition, or that it is learn-able without too much trouble. The DFDL Working
group team didn't invent DFDL's versions out of thin air. They were drawn from
examples in existing format description mechanisms of data integration tools.
I.e., they had workable precedent.
DFDL is very specific about the term padding and delimiter. A delimiter is part
of the framing which bounds the start or end of the content region. When
lengthKind="delimited" the delimiter is used algorithmically to isolate the
length of the content region. (When lengthKind is NOT delimited, the delimiter
can still be used as a redundant marker of the start/end of the content region.
Lots of data has this redundancy.)
Padding is about characters within the content region that are outside of the
value region.
This is a simplification of the grammar rules in section 9 of the DFDL spec,
but the grammar productions for DFDL-described data are roughly:
simpleElement = preFraming simpleContent postFraming
simpleContent = prePadding SimpleValue postPadding
SimpleValue is a terminal of the grammar. It is the region whose
characters/bits are converted into the value.
There are two corresponding functions in the DFDL Expression language
(sometimes called DPath since it's mostly like XPath): dfdl:valueLength(node,
units) and dfdl:contentLength(node, units). The stored length in binary data is
usually the content length, and is computed in terms of the value length, and
per the grammar above, includes the length of any padding.
Delimiters are in the pre/postFraming along with alignment regions and some
other details.
Note that use of both padding AND delimiters is very atypical of most data.
Most data uses padding when the strings are specified (fixed or expression)
length, meaning there can be ZERO pad characters if the data value consumes all
the space.
As for why you can't use dfdl:lengthKind="delimited" and somehow say "one or
more NUL characters". This use case just isn't prevalent enough in textual data
sets to be worth it. Or didn't seem to be at the time DFDL was being
formulated. The way of specifying delimiters in DFDL was intentionally kept
simpler than full regular expressions, because users have terrible trouble with
regular expressions. The dfdl:lengthKind='pattern' lets you open this door and
use full regular expressions.
The limited functionality of dfdl:lengthKind="delimited" with the character
class entities WSP, WSP*, WSP+, NL, etc. is sufficient to express many common
textual data formats.
These DFDL capabilities were adopted from examining the data-format description
capabilities of a number of industry data integration tools. I won't claim that
process is/was perfect but it was very rational and largely based on
generalizing from existing practice. Some important format needs were missed
and have since been added to DFDL (e.g., dfdl:bitOrder property came to DFDL
quite late, but is needed for a bunch of formats.) We've also definitely found
a need for what I call LSP (intra-linear space i.e., WSP without the
line-endings). So that will get added at some point.
Long winded per my usual, sorry.
...mike beckerle
________________________________
From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Monday, April 1, 2019 2:47 PM
To: [email protected]<mailto:[email protected]>
Subject: Question about parsing binary input containing strings separated by
nulls
Hello DFDL community,
My binary input file contains: string null(s) string null(s) ....
The following DFDL schema correctly parses the input file:
<xs:element name="input">
<xs:complexType>
<xs:sequence>
<xs:element name="string" type="xs:string" maxOccurs="unbounded"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
dfdl:representation="text"
dfdl:encoding="ISO-8859-1"
dfdl:textTrimKind="padChar"
dfdl:textStringPadCharacter="%NUL;"
dfdl:textStringJustification="left"
dfdl:terminator="%NUL;"/>
</xs:sequence>
</xs:complexType>
</xs:element>
But why do I need dfdl:lengthPattern?
Why can't I simply state this: the input contains an unbounded number of
strings, each string is padded by one or more nulls or ends at the end-of-file.
Why can't I throw out dfdl:lengthPattern and set dfdl:lengthKind to
"delimited"? Why doesn't the following work correctly?
<xs:element name="input">
<xs:complexType>
<xs:sequence>
<xs:element name="string" type="xs:string" maxOccurs="unbounded"
dfdl:lengthKind="delimited"
dfdl:representation="text"
dfdl:encoding="ISO-8859-1"
dfdl:textTrimKind="padChar"
dfdl:textStringPadCharacter="%NUL;"
dfdl:textStringJustification="left"
dfdl:terminator="%NUL;"/>
</xs:sequence>
</xs:complexType>
</xs:element>
/Roger