Re: Question about parsing binary input containing strings separated by nulls

Beckerle, Mike Tue, 02 Apr 2019 11:58:28 -0700

Sure.


So your element has lengthKind 'pattern', so the length of the content region 
is determined by what matches the pattern.


hello00


The terminator is in the framing, which is, according to DFDL's data grammar, 
outside of the content region.


So the length pattern gives the string hello00 because the pattern match 
doesn't include the lookahead of one NUL followed by a nonNull.


Then it will start to look for the terminator AFTER all of that content. This 
should find the remaining NUL.


The hello00 will then be the content, which is converted to a value. So if 
trimming is enabled, the value will be trimmed by the padChar to just hello, 
and since the type is xs:string, we're done.


Key to understanding this is that framing surrounds content, content surrounds 
value. Delimiters are part of the framing. Padding is part of the content.


The length determines the start and end of the content. Only when 
lengthKind='delimited' is the terminator used to determine the content, and 
even in that case, the terminator that is found in the data stream isn't part 
of the content, it is part of the framing that is after the content.










________________________________
From: Costello, Roger L. <[email protected]>
Sent: Tuesday, April 2, 2019 10:59:48 AM
To: [email protected]
Subject: Re: Question about parsing binary input containing strings separated 
by nulls


Thank you Mike and Brandon!



I’d like to follow up with a few specific questions, please.



I’d like to ask some specific questions about this:



<xs:element name="string" type="xs:string" maxOccurs="unbounded"
    dfdl:lengthKind="pattern"
    dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
    dfdl:representation="text"
    dfdl:encoding="ISO-8859-1"
    dfdl:textTrimKind="padChar"
    dfdl:textStringPadCharacter="%NUL;"
    dfdl:textStringJustification="left"
    dfdl:terminator="%NUL;"/>


Suppose the input is the following (I’ll use 0 to represent the null symbol):



Hello000World…



I’d like to focus on consuming the first string (Hello).



The lengthPattern says to consume characters up to the last null that precedes 
a non-null character. So, that results in this:



Hello00



The terminator says to consume characters up to the first null. So, that 
results in this:



Hello



So, the lengthPattern and the terminator result in different strings, right? 
Aren’t they contradicting each other?



How do the padding properties (textTrimKind, textStringPadCharacter, 
textStringJustification) come into play?



Would you give me some intuition about how Daffodil uses these seemingly 
contradictory properties to parse the input, please?



/Roger









From: Beckerle, Mike <[email protected]>
Sent: Tuesday, April 2, 2019 9:46 AM
To: [email protected]
Subject: [EXT] Re: Question about parsing binary input containing strings 
separated by nulls



Roger,



When you say "the input contains an unbounded number of strings, each string is 
padded by one or more nulls or ends at the end-of-file" that sounds simple, but 
there's lots of ambiguities.



1) are NULs also allowed at the end of file, or is it either/or strictly.



2) if a value is the empty string, then when unparsed I cannot distinguish 
consecutive pad characters delimiting a single value from the representation of 
several empty strings, each with its own single pad character after it. Are 
empty strings even allowed? Syntactically illegal?, or are they just invalid 
values?

(I think this is the same point Brandon made.)



3) you have omitted whether there is any way for a NUL to appear in the content 
via some escaping mechanism. In the absence of such a statement, one can assume 
you mean "no it cannot", but consider that in many/most delimited formats there 
is such kinds of escaping/quoting to allow delimiters to appear in the content, 
but delimiters must be distinguished from padding for that to even make sense, 
otherwise you'd have to escape each padding character, which .... just isn't 
padding.



4) you are assuming a definition of the term "padding" that might seem 
intuitive to you, but doesn't match usage of the term in DFDL. Of course people 
have legitimately different intuitive definitions for these sorts of terms, so 
DFDL has to pick exactly what it means and we hope it matches most people's 
intuition, or that it is learn-able without too much trouble. The DFDL Working 
group team didn't invent DFDL's versions out of thin air. They were drawn from 
examples in existing format description mechanisms of data integration tools. 
I.e., they had workable precedent.



DFDL is very specific about the term padding and delimiter. A delimiter is part 
of the framing which bounds the start or end of the content region. When 
lengthKind="delimited" the delimiter is used algorithmically to isolate the 
length of the content region. (When lengthKind is NOT delimited, the delimiter 
can still be used as a redundant marker of the start/end of the content region. 
Lots of data has this redundancy.)



Padding is about characters within the content region that are outside of the 
value region.



This is a simplification of the grammar rules in section 9 of the DFDL spec, 
but the grammar productions for DFDL-described data are roughly:



simpleElement = preFraming simpleContent postFraming

simpleContent = prePadding SimpleValue postPadding



SimpleValue is a terminal of the grammar. It is the region whose 
characters/bits are converted into the value.



There are two corresponding functions in the DFDL Expression language 
(sometimes called DPath since it's mostly like XPath): dfdl:valueLength(node, 
units) and dfdl:contentLength(node, units). The stored length in binary data is 
usually the content length, and is computed in terms of the value length, and 
per the grammar above, includes the length of any padding.



Delimiters are in the pre/postFraming along with alignment regions and some 
other details.



Note that use of both padding AND delimiters is very atypical of most data. 
Most data uses padding when the strings are specified (fixed or expression) 
length, meaning there can be ZERO pad characters if the data value consumes all 
the space.



As for why you can't use dfdl:lengthKind="delimited" and somehow say "one or 
more NUL characters". This use case just isn't prevalent enough in textual data 
sets to be worth it. Or didn't seem to be at the time DFDL was being 
formulated. The way of specifying delimiters in DFDL was intentionally kept 
simpler than full regular expressions, because users have terrible trouble with 
regular expressions. The dfdl:lengthKind='pattern' lets you open this door and 
use full regular expressions.



The limited functionality of dfdl:lengthKind="delimited" with the character 
class entities WSP, WSP*, WSP+, NL, etc. is sufficient to express many common 
textual data formats.



These DFDL capabilities were adopted from examining the data-format description 
capabilities of a number of industry data integration tools. I won't claim that 
process is/was perfect but it was very rational and largely based on 
generalizing from existing practice. Some important format needs were missed 
and have since been added to DFDL (e.g., dfdl:bitOrder property came to DFDL 
quite late, but is needed for a bunch of formats.) We've also definitely found 
a need for what I call LSP (intra-linear space i.e., WSP without the 
line-endings). So that will get added at some point.





Long winded per my usual, sorry.



...mike beckerle



________________________________

From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Monday, April 1, 2019 2:47 PM
To: [email protected]<mailto:[email protected]>
Subject: Question about parsing binary input containing strings separated by 
nulls



Hello DFDL community,



My binary input file contains: string null(s) string null(s) ….



The following DFDL schema correctly parses the input file:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" maxOccurs="unbounded"
                dfdl:lengthKind="pattern"
                dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
                dfdl:representation="text"
                dfdl:encoding="ISO-8859-1"
                dfdl:textTrimKind="padChar"
                dfdl:textStringPadCharacter="%NUL;"
                dfdl:textStringJustification="left"
                dfdl:terminator="%NUL;"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



But why do I need dfdl:lengthPattern?



Why can’t I simply state this: the input contains an unbounded number of 
strings, each string is padded by one or more nulls or ends at the end-of-file.



Why can’t I throw out dfdl:lengthPattern and set dfdl:lengthKind to 
“delimited”? Why doesn’t the following work correctly?



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" maxOccurs="unbounded"
                dfdl:lengthKind="delimited"
                dfdl:representation="text"
                dfdl:encoding="ISO-8859-1"
                dfdl:textTrimKind="padChar"
                dfdl:textStringPadCharacter="%NUL;"
                dfdl:textStringJustification="left"
                dfdl:terminator="%NUL;"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>



/Roger

Re: Question about parsing binary input containing strings separated by nulls

Reply via email to