It is interesting to try to do this sort of minimalist thing. I'm not sure
making DFDL easier for people who are sophisticated programmers who have used
lex/yacc/bison/antlr is the right goal however.
I want DFDL to be easier for people who couldn't possibly understand or use
those things, which are, btw. far more powerful than DFDL, which is more
limited entirely on-purpose.
I do understand why you would introduce DFDL with only strings as the sole data
type, because then you can get through the concepts of how the language hangs
together without getting bogged down in all the properties that are type
specific.
Those could then be studied separately when needed.
The basic concept of delimited or fixed length text are relatively easy to
understand and use.
But the concept of "length" is very central to data format, and that implies
the concept of integers.
I would suggest:
1) add type xs:integer also - only text, base 10, standard.
2) drop dfdl:lengthKind='pattern' and regex.
3) add dfdl:outputValueCalc
My rationale follows.
You said in your description of the text of the runway width, and I quote: "The
value of width is 0-999 and is right-justified in a 4-character field". So you
are talking right there about integers, and about text justification in a
fixed-length environment.
Why make the user disguise all that in regular-expression complexity?
Why not use exactly the concepts and terms you used in your sentence: integers
with numeric range, and vocabulary like "right justified"? Here's the
properties and facets needed:
type="xs:integer"
dfdl:lengthKind="explicit"
dfdl:length="4"
dfdl:textNumberJustification="right"
xs:minInclusive value="0"
xs:maxInclusive value="999"
These seem pretty well motivated.
btw: I still don't understand the regex you created for runway width. Why not:
"\ {1,3}(?:[1-9]\d\d|[1-9]\d|\d)"
Professional programmers pretty much universally share the experience of
finding regular expressions troublesome and difficult to use in almost any
context. (See countless web articles like:
http://www.ilian.io/the-road-to-hell-is-paved-with-regular-expressions/) They
are a useful tool that is hard to use in practice.
DFDL is supposed to be much easier than regular expressions. I believe it can
be easier if taught with the proper elaboration of concepts in the right order.
Well, the above is my rant about regular expressions.
As examples of things that cannot be expressed in your minimal DFDL: anything
with stored length or count information. E.g., this data:
5
foo
bar
baz
quux
blah
That 5 is the count of how many. You can't unparse this without
dfdl:outputValueCalc to lay down the 5 by counting that the array of elements
has length 5 at unparse time. i.e.,
dfdl:outputValueCalc='{ fn:count(../theArray) }'
Strings with prefix lengths also cannot be expressed without
lengthKind='explicit' (and outputValueCalc for unparsing)
For example this data is 2 fixed length 8-digit numbers followed by a
variable-length string with a 2-digit stored length.
123456781234567807abcdefg
another example would be an 80 character record, containing two fixed length
8-digit numbers followed by a variable length string with 2-digit stored length
like this:
123456781234567807abcdefg*******************************************************
In that case the 07 says what part of the available length is actively used for
data, but the records are always the full 80 bytes/chars. This is very common.
You have to have something like dfdl:outputValueCalc='{
dfdl:valueLength(../theString) }' or you can't unparse this in general.
In the narrow niche of cybersecurity data scanning, if you can restrict the
processing to things that never change the length nor count of anything, then
perhaps you don't need dfdl:outputValueCalc. But in general it is needed to
avoid application code having to know the intricate details of a data format.
-mikeb
________________________________
From: Roger L Costello <[email protected]>
Sent: Monday, August 2, 2021 1:47 PM
To: [email protected] <[email protected]>
Subject: Minimalist DFDL
Hi Folks,
The learning curve for DFDL is long and steep but I have found a tiny subset of
DFDL that can be learned in less than a day and has (I believe) all the power
of Full DFDL. I call the subset Minimalist DFDL. If you have experience with
other parser generators such as lex/yacc, flex/bison, or ANTLR, then you will
find their ideas directly apply to Minimalist DFDL. In other words, the
learning curve drops way down.
The following discussion applies only to text data formats. I haven’t thought
about a Minimalist DFDL for binary data formats.
The first key point in Minimalist DFDL is that there is only one datatype:
string. There are no integers, dates, Booleans, decimals, etc. What that means
is we can ignore all their properties.
The second key point is that every data item can be specified with a regular
expression (regex).
Let’s jump right in and look at an example. The example illustrates (nearly)
all the DFDL properties you need to know.
<xs:element name="Runway" dfdl:terminator="--">
<xs:complexType>
<xs:sequence dfdl:separator="/" dfdl:separatorPosition="infix">
<xs:element name="RunwayWidth" type="xs:string"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="[
]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
</xs:appinfo>
</xs:annotation>
</xs:element>
<xs:element name="RunwayComposition" type="xs:string"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[
]*">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
</xs:appinfo>
</xs:annotation>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
In my example input documents contain data about a runway: the width of a
runway followed by the composition of the runway. The value of width is 0-999
and is right-justified in a 4-character field. 0-999 is specified by this regex:
[0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9]
Since we want the width value right justified (i.e., spaces precede the value),
we need to prepend this to the regex:
[ ]*
yielding this regex:
[ ]*([0-9]|[1-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-9])
There is a bug in Daffodil which prevents that regex from working. However, I
found that by rotating the parts of the regex that describe 0-999, with the
part describing the highest value (99[0-9]) first and the part describing the
smallest value ([0-9]) last, then Daffodil works fine. So the regex is this:
[ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9])
An oddity of DFDL is that zero-length strings match regexes. We don’t want
that. To prevent that, add a dfdl:assert containing an XPath expression which
says the string length (remember, everything is a string) must be greater than
0:
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert test="{ fn:string-length(.) gt 0 }"/>
</xs:appinfo>
</xs:annotation>
However, we can do better than that. We know the length must be 4, so let’s
have the XPath expression state that the string length is 4:
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert test="{ fn:string-length(.) eq 4 }"/>
</xs:appinfo>
</xs:annotation>
Next, runway composition. It has an enumeration list of values and is
left-justified in an 8-character field. The regex is trivial: (be sure to put [
]* at the end to left-justify the enumeration value)
(ASPHALT|BRICK|CLAY|CONCR|GRASS|GRAVEL)[ ]*
Again, we add dfdl:assert to prevent zero-length strings from matching:
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert test="{ fn:string-length(.) eq 8 }"/>
</xs:appinfo>
</xs:annotation>
The data for runway width and runway composition are separated by a slash. If
we think of input data as a sentence, then slash is punctuation. We need a way
to express punctuation, and DFDL does a good job with that via the separator,
initiator, and terminator properties.
“What about nil values? Don’t we need the DFDL properties associated with
nillable?” No, we don’t. Nil values can be easily expressed in the regex. For
example, suppose that when there is no runway width data then the field must
contain a hyphen (with spaces to create a field with 4 characters). That is
easily incorporated into the regex using a regex choice:
([ ]*(99[0-9]|9[0-8][0-9]|[1-8][0-9][0-9]|[1-9][0-9]|[0-9]))| [ ]*\-[ ]*
“What about escaping values? Don’t we need the DFDL properties associated with
escapes?” No, we don’t. Escaping is something that was resolved long ago with
regexes.
“Isn’t it purer to treat numbers as numbers, dates as dates, etc. rather than
treating everything as strings?” Purer? What does that mean? It is a
meaningless term. Does Minimalist DFDL get the job done (parsing and
unparsing)? If so, that’s all that matters. I’ll take simplicity over purity
any day.
“Does Minimalist DFDL work with both parsing and unparsing?” Yes. Beautifully.
“What about hidden groups, your example doesn’t have that; are you saying that
hidden groups aren’t needed?” No, hidden groups are useful. Other things not
shown but needed include occursCountKind="implicit", dfdl:choiceLengthKind,
dfdl:choiceLength.
“What about the DFDL transformation properties such as inputValueCalc, aren’t
they needed?” No. The Minimalist DFDL philosophy is that it is a parsing
language, not a transformation language. If you need to do transformations,
then do it after parsing (using something other than DFDL).
“Aren’t regexes hard to read, write, and maintain?” Well, they are, but I’ll
make five points (1) their complexity can be managed through various naming
mechanisms (e.g., use the XML ENTITY mechanism to create named regexes), (2)
regexes have been around a long time, are well-understood with lots of
excellent regex processors, and are widely used throughout the programming
community (i.e., there exists a large pool of people who understand regexes),
(4) regexes provide razor-sharp precision (no fuzziness/ambiguity), and (5)
despite their complexity they are a whole lot easier than having to deal with a
ton of DFDL properties.
I welcome your comments. Are there text data formats that can be specified
using Full DFDL that cannot be specified using Minimalist DFDL? Concrete
examples would be appreciated.
/Roger