Re: Question about parsing binary input containing strings separated by nulls

Costello, Roger L. Tue, 02 Apr 2019 08:00:03 -0700

Thank you Mike and Brandon!

I'd like to follow up with a few specific questions, please.


I'd like to ask some specific questions about this:

<xs:element name="string" type="xs:string" maxOccurs="unbounded"
    dfdl:lengthKind="pattern"
    dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
    dfdl:representation="text"
    dfdl:encoding="ISO-8859-1"
    dfdl:textTrimKind="padChar"
    dfdl:textStringPadCharacter="%NUL;"
    dfdl:textStringJustification="left"
    dfdl:terminator="%NUL;"/>

Suppose the input is the following (I'll use 0 to represent the null symbol):

Hello000World...

I'd like to focus on consuming the first string (Hello).

The lengthPattern says to consume characters up to the last null that precedes 
a non-null character. So, that results in this:

Hello00

The terminator says to consume characters up to the first null. So, that 
results in this:

Hello

So, the lengthPattern and the terminator result in different strings, right? 
Aren't they contradicting each other?

How do the padding properties (textTrimKind, textStringPadCharacter, 
textStringJustification) come into play?

Would you give me some intuition about how Daffodil uses these seemingly 
contradictory properties to parse the input, please?

/Roger




From: Beckerle, Mike <[email protected]>
Sent: Tuesday, April 2, 2019 9:46 AM
To: [email protected]
Subject: [EXT] Re: Question about parsing binary input containing strings 
separated by nulls


Roger,


When you say "the input contains an unbounded number of strings, each string is 
padded by one or more nulls or ends at the end-of-file" that sounds simple, but 
there's lots of ambiguities.

1) are NULs also allowed at the end of file, or is it either/or strictly.

2) if a value is the empty string, then when unparsed I cannot distinguish 
consecutive pad characters delimiting a single value from the representation of 
several empty strings, each with its own single pad character after it. Are 
empty strings even allowed? Syntactically illegal?, or are they just invalid 
values?
(I think this is the same point Brandon made.)

3) you have omitted whether there is any way for a NUL to appear in the content 
via some escaping mechanism. In the absence of such a statement, one can assume 
you mean "no it cannot", but consider that in many/most delimited formats there 
is such kinds of escaping/quoting to allow delimiters to appear in the content, 
but delimiters must be distinguished from padding for that to even make sense, 
otherwise you'd have to escape each padding character, which .... just isn't 
padding.

4) you are assuming a definition of the term "padding" that might seem 
intuitive to you, but doesn't match usage of the term in DFDL. Of course people 
have legitimately different intuitive definitions for these sorts of terms, so 
DFDL has to pick exactly what it means and we hope it matches most people's 
intuition, or that it is learn-able without too much trouble. The DFDL Working 
group team didn't invent DFDL's versions out of thin air. They were drawn from 
examples in existing format description mechanisms of data integration tools. 
I.e., they had workable precedent.

DFDL is very specific about the term padding and delimiter. A delimiter is part 
of the framing which bounds the start or end of the content region. When 
lengthKind="delimited" the delimiter is used algorithmically to isolate the 
length of the content region. (When lengthKind is NOT delimited, the delimiter 
can still be used as a redundant marker of the start/end of the content region. 
Lots of data has this redundancy.)

Padding is about characters within the content region that are outside of the 
value region.

This is a simplification of the grammar rules in section 9 of the DFDL spec, 
but the grammar productions for DFDL-described data are roughly:

simpleElement = preFraming simpleContent postFraming
simpleContent = prePadding SimpleValue postPadding

SimpleValue is a terminal of the grammar. It is the region whose 
characters/bits are converted into the value.

There are two corresponding functions in the DFDL Expression language 
(sometimes called DPath since it's mostly like XPath): dfdl:valueLength(node, 
units) and dfdl:contentLength(node, units). The stored length in binary data is 
usually the content length, and is computed in terms of the value length, and 
per the grammar above, includes the length of any padding.

Delimiters are in the pre/postFraming along with alignment regions and some 
other details.

Note that use of both padding AND delimiters is very atypical of most data. 
Most data uses padding when the strings are specified (fixed or expression) 
length, meaning there can be ZERO pad characters if the data value consumes all 
the space.

As for why you can't use dfdl:lengthKind="delimited" and somehow say "one or 
more NUL characters". This use case just isn't prevalent enough in textual data 
sets to be worth it. Or didn't seem to be at the time DFDL was being 
formulated. The way of specifying delimiters in DFDL was intentionally kept 
simpler than full regular expressions, because users have terrible trouble with 
regular expressions. The dfdl:lengthKind='pattern' lets you open this door and 
use full regular expressions.

The limited functionality of dfdl:lengthKind="delimited" with the character 
class entities WSP, WSP*, WSP+, NL, etc. is sufficient to express many common 
textual data formats.

These DFDL capabilities were adopted from examining the data-format description 
capabilities of a number of industry data integration tools. I won't claim that 
process is/was perfect but it was very rational and largely based on 
generalizing from existing practice. Some important format needs were missed 
and have since been added to DFDL (e.g., dfdl:bitOrder property came to DFDL 
quite late, but is needed for a bunch of formats.) We've also definitely found 
a need for what I call LSP (intra-linear space i.e., WSP without the 
line-endings). So that will get added at some point.


Long winded per my usual, sorry.

...mike beckerle

________________________________
From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Monday, April 1, 2019 2:47 PM
To: [email protected]<mailto:[email protected]>
Subject: Question about parsing binary input containing strings separated by 
nulls


Hello DFDL community,



My binary input file contains: string null(s) string null(s) ....



The following DFDL schema correctly parses the input file:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" maxOccurs="unbounded"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
                dfdl:representation="text"
                dfdl:encoding="ISO-8859-1"
                dfdl:textTrimKind="padChar"
                dfdl:textStringPadCharacter="%NUL;"
                dfdl:textStringJustification="left"
                dfdl:terminator="%NUL;"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



But why do I need dfdl:lengthPattern?



Why can't I simply state this: the input contains an unbounded number of 
strings, each string is padded by one or more nulls or ends at the end-of-file.



Why can't I throw out dfdl:lengthPattern and set dfdl:lengthKind to 
"delimited"? Why doesn't the following work correctly?



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" maxOccurs="unbounded"
                dfdl:lengthKind="delimited"
                dfdl:representation="text"
                dfdl:encoding="ISO-8859-1"
                dfdl:textTrimKind="padChar"
                dfdl:textStringPadCharacter="%NUL;"
                dfdl:textStringJustification="left"
                dfdl:terminator="%NUL;"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



/Roger

Re: Question about parsing binary input containing strings separated by nulls

Reply via email to