Re: variable length CSV records via 'lengthKind' property

Beckerle, Mike Fri, 30 Jul 2021 13:29:46 -0700

I think we can conclude this as a DFDL rule/practice:

If optional leading zeros or optional trailing zeros must be preserved then the 
type should be string, not a numeric type.

Why: 01 and 1 are equivalent integers. But they're not equivalent text strings. 
If this difference matters to the data format (saying "must be preserved" in a 
parse-unparse cycle is the same as saying this difference matters to the 
format), you must capture this difference in the way you model this data.

If you always want the leading zeros, you can specify required leading zeros on 
the text form of an integer with textNumberPattern, but then they're not 
optional, they're always there when unparsing.

There is no way in DFDL to say "it's an integer, but if there are leading zeros 
when parsing, I consider that's a different integer than if there aren't, so 
remember that for use when unparsing." The DFDL Infoset for an integer has no 
such concept or place to remember that. In some sense the point of saying 
something is an integer in a data format is to express that these sorts of 
things are not significant to the format, so unparsing them to a canonical 
text representation of an integer is the preferred thing.

You can interpret the text string as an integer by calling a conversion 
function such as xs:integer(...).

A similar issue comes up with trailing zeros in decimal and floating-point 
numbers. If trailing zeros are optional, but "must be preserved", then they are 
part of the format, and must appear in the infoset if they are to be preserved.
The type should be xs:string, not xs:decimal, xs:float, or xs:double.

A similar but more obscure case arises in dates. Feb 29 in a non-leap year is 
the same as Mar 1. The infoset doesn't capture this. On unparse type xs:date 
will output "Mar 1". If you want the infoset to somehow remember the string, 
then well, it's a string.

________________________________
From: Attila Horvath <[email protected]>
Sent: Friday, July 30, 2021 1:13 PM
To: [email protected] <[email protected]>
Subject: Re: variable length CSV records via 'lengthKind' property

EXCELLENT!!! That fits the bill...

<xs:element name="fields" minOccurs="1" maxOccurs="1">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{1,2}"/>
</xs:restriction>
</xs:simpleType>
</xs:element>

<xs:element name="field" maxOccurs="unbounded"
dfdl:occursCount="{ xs:unsignedInt(../fields) }" 
dfdl:occursCountKind="expression">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[a-zA-Z0-9+=]{0,64}"/>
</xs:restriction>
</xs:simpleType>
</xs:element>

Thx again... after the fact.

Attila

On Fri, Jul 30, 2021 at 12:55 PM Beckerle, Mike 
<[email protected]<mailto:[email protected]>> wrote:
Model the length field as a string so its characters are perfectly preserved.

Where you need to use its value in the number of occurrences, call the 
xs:unsignedInt(...) function on the string value.

________________________________
From: Attila Horvath 
<[email protected]<mailto:[email protected]>>
Sent: Friday, July 30, 2021 12:48 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: variable length CSV records via 'lengthKind' property

Thx to all responses. Got variable length record w/ variable number of 
'specified' fields working EXCEPT unparsing does not yield "lossless" 
reconstituted data.

Per...
     <xs:element name="fields" type="xs:unsignedInt"/>

"fields" specifies number of fields in record. When "fields" has value "1", 
unparse yields lossless result. However, when "fields" has value "01", unparse 
yields lossful "1" result dropping the leading zero.

Suggestions to resolve? Our solution MUST be lossless.

Thx in advance

Attila

On 2021/07/21 14:01:12, "Beckerle, Mike" 
<[email protected]<mailto:[email protected]>> wrote:
> There is an example of doing dfdl:occursCountKind 'expression' on github in 
> the CSV example here:
>
> https://github.com/DFDLSchemas/CSV/blob/master/src/main/resources/com/tresys/csv/xsd/csvHeaderEnforced.dfdl.xsd
>
> Line 75 of this file, it uses this technique to ensure each row has the same 
> number of items as there were titles in the initial header row. Your case you 
> want a different expression to pick out the exact count from an earlier 
> element, but the example may help.
>
>
>
> ________________________________
> From: Steve Lawrence <[email protected]<mailto:[email protected]>>
> Sent: Wednesday, July 21, 2021 9:52 AM
> To: [email protected]<mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
> Subject: Re: variable length CSV records via 'lengthKind' property
>
> You cannot dynamically set maxOccurs--that has to be either a static
> number or "unbounded". This is a restriction of XML Schema that DFDL is
> based off of.
>
> But Daffodil does support what you're trying to do . First, you'll want
> to set maxOccurs="unbounded" (since we don't know how many parameter
> instances there might be). Bbut then we can use the
> dfdl:occursCountKind="expression" and dfdl:occursCount properties to
> evaluate an expression at parse time to set the number of occurrences to
> the value of the third field. For example:
>
>   ...
>   <xs:element name="field3" type="xs:int" ... />
>   <xs:element name="parameters" type="..." maxOccurs="unbounded"
>     dfdl:occursCountKind="expression"
>     dfdl:occursCount="{ ../field3 }" ... />
>   ...
>
> So we parse field3 to an integer and then use that value in an
> expression to determine the number of occurrences of the parameters array.
>
>
> On 7/21/21 8:04 AM, Horvath, Attila J CTR (USA) wrote:
> > ALCON - see attached PDF version of this message
> >
> > I have a Character Separated Values [CSV] ASCII input file where:
> >
> > - fields are ASCII STX character [0x02] separated
> >
> >    - variable number of fields per below
> >
> > - records are ASCII DC4 character [0x14] terminated
> >
> > - file is ASCII EOT character [0x04] terminated
> >
> > In following sample excerpt, highlighted in YELLOW is last record of data 
> > file.
> >
> > 3rd field [31 30] specifies 10 fields to follow each STX [0x02] delimited,
> >
> > followed by record and file terminators [0x14 0x04] respectively...
> >
> > Is it possible to read the 3rd field ['10'], save it via 
> > [*_dfdl:lengthKind_*
> > and/or *_dfdl:defineVariable_*] as "NUM_FIELDS" and pass that value to
> > 'maxOccurs' attribute on per record basis?
> >
> > EG:...
> >
> > <xs:element name="parameters" maxOccurs=NUM_FIELDS minOccurs="1">
> >
> > ...
> >
> > </xs:element>
> >
> > Thx in advance
> >
> > Attila
> >
>
>

Re: variable length CSV records via 'lengthKind' property

Reply via email to