Re: Minimalist DFDL, part II

Steve Lawrence Thu, 12 Aug 2021 11:06:02 -0700

Yep, this is related to the discussion about well-formed vs valid. It's
not uncommon, and often preferred, to model only the syntax of the data
so that you can parse data that is syntactically correct (i.e.
well-formed) but isn't semantically correct (i.e not valid), and then do
the validation later.

That would for example let you parse data with the number "4" but only 3
students listed, but then use XSLT/Schematron to validate that the
counts don't match up.

That said, I think you'll still often need occursCountKind="expression".
Once you start modeling more complicated data formats, you almost always
start seeing repetitions of types, and you often can't use speculative
parsing to differentiate between the types. And the only solution is
with expressions to figure out the occurrences.

For example, say we had this data:

  3
  2
  John Doe
  Sally Smith
  Judy Jones
  Richard Roe
  Bob Barker

We really don't want to think of this as two numbers followed by 5
strings. That just isn't going to be useful. We instead want to think of
this as two numbers that specify the number of students and the number
of teachers, followed by a list of the student names and a list of the
teacher names. And so we really want an infoset that looks like this:

  <People>
    <NumStudents>3</NumStudents>
    <NumTeachers>2</NumTeachers>
    <Students>
      <name>John Doe</name>
      <name>Sally Smith</name>
      <name>Judy Jones</name>
    </Students>
    <Teachers>
      <name>Alice Anderson</name>
      <name>Bob Brown</name>
    </Teacher>
  </People>

Notice this data doesn't allow speculative parsing to differentiate
student names from teacher names--they names have the exact same form.
So the only way to know when one ends and the other begins is by using
occursCountKind="expression" and an expression to reach back into the
parsed numbers to figure out the number of occurrences.

- Steve

On 8/12/21 1:01 PM, Roger L Costello wrote:
> Hi Folks,
> 
> A couple of weeks ago Mike Beckerle pointed out that many data formats 
> contain things like this:
> 
> A number, N
> N occurrences of something
> 
> For example, 3 followed by the names of three students:
> 
> 3
> John Doe
> Sally Smith
> Judy Jones
> 
> How should that be parsed? Using the DFDL occursCount and 
> occursCountKind="expression" and hiddenGroup you can parse the input to 
> ensure that exactly three student names are consumed. The output is this XML:
> 
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
> 
> But is it really the job of the parser to "ensure that exactly three student 
> names are consumed"?
> 
> I raised this question to the compiler experts on the compilers Usenet list. 
> Here's what one person wrote:
> 
>> I would contend that in your example the /syntax/ of lists is really a 
>> number 
>> followed by zero or more strings (number string*), and that verifying the 
>> string 
>> count is semantics, not syntax.  I believe that, whenever possible, 
>> semantics are 
>> best left until after parsing is finished.
> 
> In other words, keep your DFDL schema simple: forget 
> occursCountKind="expression" and hiddenGroup; just parse the number and the 
> following strings. The output should be this:
> 
> <number>3</number>
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
> 
> If you need to "ensure that there are 3 student names" you can do that check 
> *after* parsing.
> 
> This is the Minimalist DFDL philosophy.
> 
> /Roger
> 
>

Re: Minimalist DFDL, part II

Reply via email to