[
https://issues.apache.org/jira/browse/DAFFODIL-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831839#comment-17831839
]
Mike Beckerle commented on DAFFODIL-2884:
-----------------------------------------
Good point. There's the string-as-X (where X is XML, JSON, or anything else we
care about and can't write a parser for in DFDL), and that's separate from what
it means to project it into an infoset representation of type Y.
I would use the term "well-formed" not "valid" for this, as validation has all
this overloaded meaning depending on the format being XML or whatever.
The interesting case for the embed into infoset is string of format Y is well
formed and gets embedded into an infoset of that same format Y. For XML one
challenge is that a string of XML inside some data may use many things in XML
that are not allowed by DFDL schemas, such as attributes, mixed content, etc.
so we really cannot enable validation beyond Daffodil's "limited" mode
facet/min/maxOccurs checking.
As we built out more JSON support, we should be careful not to repeat these
mistakes by assuming the JSON is formed in whatever conventions Daffodil
creates it, as embedded chunks of JSON won't necessarily adhere to those
conventions.
UDFs (or just built-in such functions) that embed parsers so they can answer
isWellFormedXML, isWellFormedJSON, etc. are very feasible and would give users
control of what to do if the string isn't well-formed - reject the data as
unparsable, capture as a plain string, etc.
A final additional thought is that the new Layer API stuff may be helpful here
in that it could parse the data according to a syntax (XML, JSON, etc.) and
populate a variable with whether the data is well-formed or not, or even
validate it according to a schema if requested and populate a variable with
error message strings if invalid.
> String-As-XML cause SDE on malformed XML data. Needs to be PE.
> --------------------------------------------------------------
>
> Key: DAFFODIL-2884
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2884
> Project: Daffodil
> Issue Type: Bug
> Components: Back End
> Affects Versions: 3.6.0
> Reporter: Mike Beckerle
> Priority: Major
>
> When using the string-as-XML feature, currently if the string that is
> supposed to be XML is malformed, then a WstxUnexpectedCharException (or other
> similar exception) gets thrown in the InfosetOutputter which is what does the
> string-of-XML to actual XML conversion. The InfosetOutputter is outside the
> scope of backtracking, so this error cannot be converted into a ParseError at
> this point. The InfosetOutputter currently escalates this to an SDE.
> That's not correct for a data problem. The parser could be speculating down a
> path where the string of data that is supposed to be XML is just gibberish.
> If that string is malformed XML, a Parse Error needs to occur so we can
> backtrack.
> Converting Infoset into XML is normally something done by the
> InfosetOutputter, but in this case it cannot be. It needs to be done in the
> string parser, and the Infoset needs to somehow cache the resulting XML so it
> can be handed off to the InfosetOutputter.
> I think this has to work analogously to text numbers. We parse the string
> first, then convert to the data type, which for numbers is an
> integer/float/decimal, etc. This conversion can fail, and that's a Parse
> Error. String-as-XML needs to work the same way. The string is parsed via one
> of the lengthKind techniques, then it is converted into XML. If the
> conversion to XML fails, then it's a Parse Error.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)