[
https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955975#comment-17955975
]
Mike Beckerle commented on DAFFODIL-1559:
-----------------------------------------
This CRLF being lossy issue came up again recently.
I'd like to suggest that we add a DFDL property to help here, rather than some
way to parameterize infoset outputters/inputters.
Note that DFDL allows us to specify custom character set encodings. So we could
define X-DAFFODIL-ASCII-JSON
to mean the same thing as ascii, but with C0 controls turned into JSON style
escapes (along with backslash being doubled).
Given that we can do that, it's fair game to say "properties can control the
way strings convert into the infoset".
So we provide an extension property such as dfdlx:infosetStringRemap which
identifies how to insert escapes/replacement characters into strings of the
infoset.
So dfdl:encoding="ascii" dfdlx:infosetStringRemap="json" would be the same
thing as I just described as X-DAFFODIL-ASCII-JSON encoding. A variation
"jsonExceptLF" would be the same except leaving LF alone.
There really is no reason not to allow a DFDL property to control this
behavior.
This has the potential to be more efficient, as it can be done as the string is
parsed/unparsed, rather than as yet another operation on the string value after
the infoset has been created.
The various suggestions provided in this thread could each have a name. (The
ones we bother to implement)
I think the current scheme (the default) would be named
"Xml1.0IllegalRemapDropCR" or something else that makes it clear the CRs are
going to be dropped, except now they would be dropped before the
InfosetOutputter processes them.
There are implications here. If in the DFDL schema you have facet patterns
those regex's would be processing the infoset which would already have the
escaping applied to it, and that's different from now where those patterns
operate on the idealized DFDL infoset string.
I think in Cyberia (Cyber security application area) the requirements are to
have options allowing you to achieve these requirements (not simultaneously)
# not lossy
# number of infoset characters matches number in original data
# canonicalizes line endings.
Not all of those are necessarily satisfied at the same time. JSON styles would
satisfy (1) and be mostly readable, jsonExceptLF would improve readability.
Things like replacing CRLF by U+202B and LF, and replacing isolated CR by
Unicode 2028, that plus PUA remapping would allow achieving (2). Choice (3) can
be achieved by ordinary DFDL where you parse the string into an array of
strings delimited by all the various kinds of line endings.
> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
> Key: DAFFODIL-1559
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
> Project: Daffodil
> Issue Type: Improvement
> Components: API
> Reporter: Steve Lawrence
> Priority: Major
> Labels: beginner
> Fix For: 4.0.0
>
>
> See the review or more details. The short of it is that when converting parse
> results to XML, we convert CR to LF, and we convert CRLF to LF. This means
> that we lose the information that the data used to contain CRLF. This is
> similar to how we lose that information with delimiters if someone uses NL,
> but it's slightly different since it is actual data. However, it's most user
> friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this
> information can be maintained if someone needs it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)