Steve wrote:
You can actually get rid of this error without
documentFinalTerminatorCanBeMissing by removing the ambiguity of the sequences
in a choice. For example, we could replace the choice with the
following:
<xs:element name="EOL" xs:type="string" minOccurs="0"
dfdl:initiator="%NL;" dfdl:lengthKind="explicit" dfdl:length="0" />
Wow! That is fantastic!
I went a step further and made that EOL element hidden. See below. It works
great! Thanks Steve!
<xs:sequence dfdl:hiddenGroupRef="hidden-newline" />
...
<xs:group name="hidden-newline">
<xs:sequence>
<xs:element name="EOL" type="xs:string" minOccurs="0"
dfdl:initiator="%NL;" dfdl:lengthKind="explicit" dfdl:length="0" />
</xs:sequence>
</xs:group>
-----Original Message-----
From: Steve Lawrence <[email protected]>
Sent: Sunday, November 10, 2019 3:43 PM
To: [email protected]
Subject: [EXT] Re: Is it okay to officially publish a DFDL schema that produces
warnings on valid input data?
The difference here is that this warning only appears when *compiling* the
schema, just to alert you that the schema might not give you the expected
behavior. In this case, it's relatively easy to know that that the schema is
doing what we expect and that the warning can be ignored.
However, the warning about left over data appears when *parsing*, and only in
some casds. So you'd probably need to verify during each parse if that warning
is safe to be ignored or not. In most cases with this schema, it probably does
mean that there's just a left over NL and it can be ignored. But it's also
possible that some other parse error occurred and Daffodil stopped parsing
halfway through the data, leaving more than just a newline of left over data.
In that case, this warning might actually be a problem.
And in fact, some uses case might even consider left over data an error because
the schema doesn't describe the complete data. This is actually what the
Daffodil NiFi processor does. If the entire data isn't parsed the processor
considers it an error.
So while these are both technically warnings, they sort of have different
severities and potential imlications.
Also, you can actually get rid of this error without
documentFinalTerminatorCanBeMissing by removing the ambiguity of the sequences
in a choice. For example, we could replace the choice with the
following:
<xs:element name="EOL" xs:type="string" minOccurs="0"
dfdl:initiator="%NL;" dfdl:lengthKind="explicit" dfdl:length="0" />
So we have an optional zero-length string element with the NL initiator.
If the NL exists, then then EOL element will be in the infoset and NL will be
unparsed. If the data does not end with the NL, the initiator will not be found
and the EOL element will not be in the infoset (which is valid since it's
optional).
This is modeling the NL as data as Mike mentioned in a previous email.
On 11/10/19 9:42 AM, Costello, Roger L. wrote:
> Steve wrote:
>
>> I think it would be reasonable to
>> ignore this warning.
>
> But, but, but, ...
>
> Mike said (paraphrasing) that it is unwise to officially publish a DFDL
> schema that produces warnings on valid data.
>
> It appears that it is impossible to avoid getting a warning message (for the
> CSV data format where the last record of a CSV file may or may not have a
> newline) until dfdl:documentFinalTerminatorCanBeMissing="yes" is implemented.
> Do you agree?
>
> /Roger
>
> -----Original Message-----
> From: Steve Lawrence <[email protected]<mailto:[email protected]>>
> Sent: Sunday, November 10, 2019 9:32 AM
> To: [email protected]<mailto:[email protected]>
> Subject: [EXT] Re: Is it okay to officially publish a DFDL schema that
> produces warnings on valid input data?
>
> When unparsing a choice, we use the infoset to determine which branch of the
> choice to unparse. For example, say we had this choice:
>
> <xs:choice>
> <xs:element name="A" type="xs:string" ... />
> <xs:element name="B" type="xs:int" ... />
> </xs:choice>
>
> If the infoset contained the "A" element, then we would unparse the first
> branch of the choice. If the infoset contained the "B" element, then we would
> unparse the second.
>
> However, in this new choice you have, both branches only contain a sequence,
> which do not have a representation in the infoset. So when unparsing we don't
> know which branch to take.
>
> That warning is trying to alert you that Daffodil will just have to pick one,
> and that it might not be the one you expected. Daffodil will currently always
> unparse the first of the ambiguous branches.
>
> So this warning is actually normal and expected in this case. I think it
> would be reasonable to ignore this warning.
>
>
> On 11/10/19 8:54 AM, Costello, Roger L. wrote:
>> Mike wrote:
>>
>> I suggest adding this
>>
>> <choice>
>>
>> <sequence dfdl:initiator="%NL;" />
>>
>> <sequence />
>>
>> </choice>
>>
>> At the end of the schema after the repeating row element.
>>
>> This will absorb and discard any final newline.
>>
>> Oh! That is a wicked cool idea! I gave it a try. Daffodil doesn't seem to
>> like it:
>>
>> [warning] Schema Definition Warning: Multiple choice branches are
>> associated with the end of element {}csv.
>>
>> Note that elements with dfdl:outputValueCalc cannot be used to
>> distinguish choice branches.
>>
>> Note that choice branches with entirely optional content are not allowed.
>>
>> What does that message mean? How to fix it?
>>
>> /Roger
>>
>> *From:* Beckerle, Mike <[email protected]<mailto:[email protected]>>
>> *Sent:* Sunday, November 10, 2019 7:56 AM
>> *To:* [email protected]<mailto:[email protected]>
>> *Subject:* [EXT] Re: Is it okay to officially publish a DFDL schema
>> that produces warnings on valid input data?
>>
>> I would avoid this.
>>
>> One thing you need to take a position on is whether on unparsing you
>> generate this final new line, or not, or try to preserve whatever the file
>> had originally.
>>
>> Choosing to always generate this, or always omit it is canonicalization.
>>
>> I suggest adding this
>>
>> <choice>
>>
>> <sequence dfdl:initiator="%NL;" />
>>
>> <sequence />
>>
>> </choice>
>>
>> At the end of the schema after the repeating row element.
>>
>> This will absorb and discard any final newline.
>>
>> If you want to preserve the final newline then you have to model it
>> as data so change the first branch of the choice above and make it an
>> element named 'finalNewLine' with initiator and type string with explicit
>> length 0.
>>
>> ---------------------------------------------------------------------
>> -
>> ----------
>>
>> *From:*Costello, Roger L. <[email protected]
>> <mailto:[email protected]>>
>> *Sent:* Saturday, November 9, 2019 8:05:19 AM
>> *To:* [email protected]<mailto:[email protected]>
>> <mailto:[email protected]>
>> <[email protected]
>> <mailto:[email protected]<mailto:[email protected]%20%3cmailto:[email protected]>>>
>> *Subject:* Is it okay to officially publish a DFDL schema that
>> produces warnings on valid input data?
>>
>> Hi Folks,
>>
>> Suppose you are creating the official, standard DFDL schema for a data
>> format.
>> Would you be okay with officially releasing a schema that generates
>> warnings on data that is valid?
>>
>> Here's an example. The RFC for CSV (RFC 4180) says that CSV files
>> consist of records separated by newlines. Each record consists of
>> fields separated by commas. The last record may or may not have a new line.
>>
>> Suppose the last record of a CSV file has newline. My DFDL schema
>> generates this
>> warning:
>>
>> *[warning] Left over data. Consumed 1680 bit(s) with at least 16
>> bit(s) remaining.*
>>
>> I am thinking that that warning is okay. Why? Because when the last
>> record has a newline, then the file /really does/ have left over data
>> - the newline on the last record. So, a warning is not unreasonable.
>>
>> Well, that's what I think. I might be thinking wrongly. What do you
>> think? Would you ever officially release a DFDL schema that generates
>> warnings on valid input data?
>>
>> /Roger
>>
>