Re: How to specify data with two fields, no delimiter, variable length?

Steve Lawrence Tue, 20 Jul 2021 07:23:38 -0700

Technically, it is the dfdl:assert that specifies something to check
immediately after the element is successfully parsed. And in this case,
the assert expressions happens to call the dfdl:checkConstraints
function, which validates what was parsed against the schema restrictions.


But yes, that's the right idea. The assert/checkConstrints only happens
immediately after the parse. It is the other dfdl properites, like
dfdl:lengthKind, that are first used to determine how to parse that field.



On 7/20/21 10:10 AM, Roger L Costello wrote:
> Thanks again Steve. To confirm my understanding: dfdl:checkConstraints 
> specifies something to check *after parsing* has been performed. The DFDL 
> schema must specify *how to parse*, which is why we need to specify 
> dfdl:occursKind="pattern" and dfdl:pattern="...".  Do I understand correctly?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Tuesday, July 20, 2021 9:49 AM
> To: [email protected]
> Subject: [EXT] Re: How to specify data with two fields, no delimiter, 
> variable length?
> 
> The enumeration + checkConstraints approach doesn't give daffodil any 
> information about the length of the field. Those are only used to validate 
> the field *after* it has been parsed.
> 
> So how is Daffodil determining the length of the field if you haven't 
> specified a length? My guess is since the schema compiles, that probably 
> means that your global dfdl:format has set lengthKind="delimited"--other 
> values would probably fail to compile since additional properties are 
> required.
> 
> And with lengthKind="delimited" and no delimiters in scope, the length is 
> just all the data up until the end-of-file is reached. So your item1 is going 
> to be parsed as the entire contents of the file (including any newlines), 
> which will fail the enumeration constraint.
> 
> So even if you add the enumartion + checkConstratins, you still need the 
> pattern length to tell Daffodil the length of the field (either of the ones I 
> mentioned should work).
> 
> On 7/20/21 9:34 AM, Roger L Costello wrote:
>> Thank you Steve. Terrific explanation. 
>>
>> I tried the approach you described - dfdl:lengthKind="pattern" 
>> dfdl:lengthPattern="ABC|AB|AC|A" - and it worked great.
>>
>> I also tried using enumeration facets coupled with 
>> dfdl:checkConstraints within dfdl:assert
>>
>> <xs:element name="item1">
>>     <xs:annotation>
>>         <xs:appinfo 
>>             source="http://www.ogf.org/dfdl/";>
>>             <dfdl:assert 
>>                 test="{ dfdl:checkConstraints(.) }"
>>                 message="The value of item1 is not one of the allowable 
>> values" 
>>             />
>>         </xs:appinfo>
>>     </xs:annotation>
>>     <xs:simpleType>
>>         <xs:restriction base="xs:string">
>>             <xs:enumeration value="A" />
>>             <xs:enumeration value="ABC" />
>>             <xs:enumeration value="AB" />
>>             <xs:enumeration value="AC" />
>>         </xs:restriction>
>>     </xs:simpleType>
>> </xs:element>
>>
>> But that did not work. Why does that not work?
>>
>> /Roger
>>
>> -----Original Message-----
>> From: Steve Lawrence <[email protected]>
>> Sent: Monday, July 12, 2021 2:39 PM
>> To: [email protected]
>> Subject: [EXT] Re: How to specify data with two fields, no delimiter, 
>> variable length?
>>
>> In cases like these, you need to use dfdl:lengthKind="pattern" and a regular 
>> expression to define the length of the first item.
>>
>> There's lots of different regexs depending on what kinds of infosets you 
>> want to allow.
>>
>> For example, one approach for the first item is a very strict regex that 
>> matches exactly one of the four values, e.g.
>>
>>   <xs:element name="item" type="xs:string"
>>     dfdl:lengthKind="pattern" dfdl:lengthPattern="ABC|AB|AC|A" />
>>
>> With this approach, the item will get a non-zero length if it is one of 
>> those items. Otherwise the item will be the empty string. And if you don't 
>> want empty string to be allowed, you need to add an assert that the length 
>> is greater than zero. Also, note that order in the regex matters so it 
>> matches the longest possibility first.
>>
>> On the other end of the spectrum, you could instead model the first item to 
>> match as many non-digits as possible:
>>
>>   <xs:element name="item" type="xs:string"
>>     dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
>>
>> This will match any of the four allowed values, but will also match anything 
>> else up to the first digit. So this could potentially produce infosets with 
>> an item value of XYZ, for example. In some cases, you might actually want 
>> this--we might consider the data to be "well-formed"
>> but not "valid". So you still get an infoset, it's just not "valid".
>> Whereas in the first case, you could only get a valid infoset.
>>
>> You'll probably also need to use regex length for matching the numeric item 
>> if there's no delimiter after the number.
>>
>> So putting it together, and using the second approach for both items, you 
>> might do something like this:
>>
>>   <xs:sequence>
>>     <xs:element name="item1 type="xs:string"
>>       dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
>>     <xs:element name="item2" type="xs:int"
>>       dfdl:lengthKind="pattern" dfdl:lengthPattern="[0-9]*" />
>>   </xs:sequence>
>>
>> So the first item is string parsing as many non-digits as possible, and the 
>> second is an int parsing as many digits as possible. Note that this approach 
>> probably should have limits on the regex length in case the data is 
>> bad/malformed. For example, if the data didn't contain numbers then item1 
>> would just consume the entire data. So instead of *, you might instead want 
>> to use something like "{0,10}" for both regexes.
>>
>> - Steve
>>
>> On 7/12/21 2:05 PM, Roger L Costello wrote:
>>> Hi Folks,
>>>
>>> I have a data field composed to two items. 
>>>
>>> The values for the first item can be enumerated:
>>>
>>>     A
>>>     ABC
>>>     AB
>>>     AC
>>>
>>> The values for the second item is any integer 0-999
>>>
>>> So, here is a same data field:
>>>
>>>     A250
>>>
>>> How do I parse that using DFDL? I reckon I'm stuck.
>>>
>>> /Roger
>>>
>>
>

Re: How to specify data with two fields, no delimiter, variable length?

Reply via email to