Re: How to dynamically specify the length and datatype of an element?

Steve Lawrence Mon, 08 Oct 2018 06:48:02 -0700

Sure thing. The basic idea is something like this:

1. Pass One: Create a DFDL schema that describes and parses just the
dbase header, and parse your dbase data using that. Note that this will
warn about left over data since it only pares the header of the data:


  daffodil parse -s dBaseHeader.dfdl.xsd -o header.xml input.dbase

In your example, this will generate something like this:

  <field-descriptor-array>
    <field>
      <name>station-name</name>
      <length>254</name>
      <datatype>string</datatype>
    </field>
    <field>
      <name>line<name>
      <length>100</name>
      <datatype>string</datatype>
    </field>
    <field>
      <name>isActive<name>
      <length>1</name>
      <datatype>boolean</datatype>
    </field>
  </field-descriptor-array>

2. Transform that field descriptor XML to a new DFDL schema using a
custom XSLT:

  xsltproc -o genDBase.dfdl.xsd dBaseHeaderToDFDLSchema.xsl header.xml

Your XSLT might transform the above XML to something like this:

  <xs:schema>
    <xs:include schemaLocation="dbaseHeader.dfdl.xsd" />

    <xs:element name="dBase">
      <xs:complexType>
        <xs:sequence>
          <xs:element ref="field-descriptor-array" />
          <xs:element name="record" ... >
            <xs:complexType>
              <xs:sequence>
                <xs:element name="station-name" dfdl:length="254"
dfdl:type="xs:string" ... />
                <xs:element name="line" dfdl:length="100"
dfdl:type="xs:string" ... />
                <xs:element name="isActive" dfdl:length="1"
dfdl:type="xs:boolean" ... />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
          ...
        </xs:sequence>
      </xs:complexType>
    </xs:element>
  </xs:schema>

Note that most of this is boilerplate except for the elements specific
to this dbase file, which have the names, lengths, and types that you
would expect.

3. Pass Two: Use that generated DFDL Schema to parse the file again, but
this time the entire file will be parsed:

  daffodil parse -s genDBase.dfdl.xsd -o dbase.xml input.dbase

The result of this might look something like this:

  <dBase>
    <field-descriptor-array>...</field-descriptor-array>
    <record>
      <station-name>Van Dorn Street</station-name>
      <line>blue</line>
      <isActive>true</isActive>
    </record>
    ...
  </dBase>

Which is much closer to what you are looking for.

- Steve

On 10/8/18 9:07 AM, Costello, Roger L. wrote:
> Hi Steve,
> 
>> It can be done with a two pass solution, though.
> 
> Okay, I'll give this 2-pass approach a try. However, I've never done this 
> before. Does the first pass generate XML? And then the second pass (somehow) 
> uses the XML to parse the remainder of the dBase file? I don't have any idea 
> how to do this. Would you sketch out how to do 2-passes in Daffodil, please?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Monday, October 8, 2018 8:19 AM
> To: [email protected]; Costello, Roger L. <[email protected]>
> Subject: Re: How to dynamically specify the length and datatype of an element?
> 
> It is possible to have a dynamic length using dfdl:lengthKind="explicit"
> and setting dfdl:length to an expression that reaches into the 
> field-descriptor-array to get the length.
> 
> However, there's no way to set the type dynamically at parse time.
> Element types must be statically defined in the DFDL schema, just like their 
> names. You could perhaps parse each field as a xs:hexBinary type and then use 
> XSLT to transform that hex binary based on the type, but then you lose alot 
> of the benefits that DFDL/Daffodil provides.
> 
> To me, this sounds like a format that is self-descriptive--the specification 
> for that data is within the data itself. DFDL/Daffodil does not usually 
> handle these types of formats very well. It can be done with a two pass 
> solution, though. The first pass uses a schema that describes and parses only 
> the header of the data. The resulting XML infoset is then transformed into 
> another DFDL schema based on the self-description. The remaining data can 
> then be parsed with that generated schema.
> 
> This has clear performance implications since you need to perform a transform 
> and compile a new DFDL schema for every new piece of data, but it is really 
> the only way to handle these self describing formats.
> 
> - Steve
> 
> On 10/8/18 7:55 AM, Costello, Roger L. wrote:
>> Hello DFDL community!
>>
>> I am creating a DFDL schema to parse dBase files.
>>
>> A dBase file consists of a list of records. Each record consists of a list 
>> of fields. Prior to the list of records is a header which describes each 
>> record field: the field's name, the length of the field's value, and its 
>> datatype (string, date, numeric, boolean, etc.). For example, I have a dBase 
>> file containing railway data and the file looks like this (albeit in binary):
>>
>> Field-descriptor-array
>>     Field
>>         name: station-name
>>         length: 254
>>         datatype: string
>>     Field
>>         name: line
>>         length: 100
>>         datatype: string
>>     Field
>>         name: isActive
>>         length: 1
>>         datatype: boolean
>>
>> Here is a record:
>>
>>     Van Dorn Street
>>     blue
>>     T
>>
>> Ideally, parsing the dBase file would yield this XML:
>>
>>     <record>
>>         <station-name>Van Dorn Street</station-name>
>>         <line>blue</line>
>>         <isActive>true</isActive>
>>     </record>
>>     
>> However, that requires element names be dynamically generated, which is not 
>> currently supported. So, instead I can design the DFDL schema to generate 
>> this XML:
>>
>>     <record>
>>         <field>Van Dorn Street</field>
>>         <field>blue</field>
>>         <field>true</field>
>>     </record>
>>
>> That will require the DFDL schema to calculate the number of <field> 
>> elements:
>>
>>     <xs:element      name="field" 
>>              minOccurs="0" 
>>              maxOccurs="unbounded" 
>>              dfdl:occursCountKind="expression" 
>>              dfdl:occursCount="count{../../Field-descriptor-array/Field}" 
>>                      ...
>>
>> Does this seem reasonable thus far?
>>
>> Now I am stuck: how to specify the length and the datatype of each field 
>> element? The i'th <field> element must have a length and datatype as 
>> specified in the i'th Field (which are in the header section). For the 
>> example above, the first <field> element must be a string with length 254 
>> characters, the second <field> element must be a string with length 100 
>> characters, and the third <field> element must be a boolean with length 1 
>> byte. How do I dynamically specify length and datatype?
>>
>> /Roger
>>
> 
>

Re: How to dynamically specify the length and datatype of an element?

Reply via email to