Very undesirable to have   instead of the literal character.

I'm not seeing where in Daffodil this would be happening. We do remap certain 
characters to the private use area - XML illegal character. But there is 
nothing illegal about U+00A0. It's just a character.


I am puzzled.


Some of the infoset outputters call scala.xml.Utility.escape(...), others 
don't, which is itself an issue, but I tested this, and that doesn't convert 
U+00A0 into the   that you are observing. Nor does our remapping call.


________________________________
From: Sloane, Brandon <bslo...@tresys.com>
Sent: Monday, June 24, 2019 10:38:10 AM
To: dev@daffodil.apache.org
Subject: Re: Character Encodings - No Statement

Slightly different issue from what I was expecting. Daffodil appears to be 
output U+00A0 as "&#xA0;" instead of as a literal character.


This is not wrong, and I believe a compliant XML processor should not notice 
the difference, but is this desireable?


Additionally, it appears to not be simply a padding character. In my test data, 
I observed the string: "ADP&#xA0;ADP&#xA0;".

________________________________
From: Beckerle, Mike <mbecke...@tresys.com>
Sent: Tuesday, June 18, 2019 9:28:07 AM
To: dev@daffodil.apache.org
Subject: Re: Character Encodings - No Statement

One other possible mechanism:


nillable elements with dfdl:nilKind="literalCharacter"


This is a mechanism designed to handle fixed-length data where the "storage" 
for the data is filled with a character/byte and the parts of it that are 
in-use are overwritten with actual data. The unwritten data is then recognized 
as nilled based on appearance of the literalCharacter throughout the data field.


The only thing that bugs me about this is that XSD doesn't allow 
nillable="true" as part of a type definition, you have to put in on an element 
declaration, which means you can't abstract over it without committing to some 
element name. I have the same complaint about dimensionality - tied to elements 
therefore to element names.

________________________________
From: Sloane, Brandon <bslo...@tresys.com>
Sent: Monday, June 17, 2019 5:43:23 PM
To: dev@daffodil.apache.org
Subject: Re: Character Encodings - No Statement

The field it occurs in is fixed-length, so a padding character makes sense.


I am a bit concerned about implications of using a character that looks like a 
space. This type of character homophone seems like a potential source of errors 
for people using the schema. Assuming we are correct that this character in 
intended as padding, we can probably avoid this issue by advising schema 
writers to specify U+A0 as a padding character, so it doesn't actually make it 
into the infoset.

________________________________
From: Beckerle, Mike <mbecke...@tresys.com>
Sent: Monday, June 17, 2019 5:17:20 PM
To: dev@daffodil.apache.org
Subject: Re: Character Encodings - No Statement

This sounds like fixed length data fields, or min-length data fields. So the 
character to use wants to be similar in concept to the pad character - i.e., it 
is used to add length to a fixed length field, but has no significance.


I suggest using U+A0 which is "Non Break Space". This is a space for all 
practical purposes, differing only in how it is treated by hyphenation 
algorithms. Using this instead of regular space will allow this data to 
round-trip.


This character should render like a space in every unicode-aware context.

________________________________
From: Sloane, Brandon <bslo...@tresys.com>
Sent: Monday, June 17, 2019 4:55:09 PM
To: dev@daffodil.apache.org
Subject: Character Encodings - No Statement

I am going through link16 (mil-std-6016e, not publically available) to add 
support for some of the special character encodings to Daffodil (simmilar to 
dfi264:dui001 that has already been added).


While doing so, I came across DFI 311 DUI 002. Several bitcodes are 
"UNDEFINED", which I intend to translate into U+FFFD ('�' replacement 
character), which is what we are doing for 264:001.


However, there is also an explicit coding for a NO STATEMENT character. Any 
insight in what a reasonable choice for translating NO STATEMENT to unicode is?


Regards,


Brandon T. Sloane

Associate, Services

bslo...@tresys.com | tresys.com

Reply via email to