I sent a request for clarifications to the DFDL workgroup to get the other 
participants to weigh in.

But I think the answer is going to be that delimited binary is as general as 
text binary, and all of the things like escape schemes etc. are all required 
because they are not prohibited.

This means we really want to leverage the existing delimiter scanning code.

DFDL actually requires a "byte level scanner". Daffodil currently implements a 
text character scanner.

DFDL allows one to specify things like

dfdl:terminator="%#rFF;" dfdl:encoding="utf-8"

Which means the terminator is actually byte FF, which isn't a legal character 
code in utf-8. The FF would screw up the utf-8 decoder/encoder. If you read FF 
with the UTF-8 decoder, you will either get an error or the unicode replacement 
character depending on dfdl:encodingErrorPolicy.

But given the above, fundamentally DFDL requires a byte-level scanner.  You 
can't implement DFDL fully with delimiter scanning consuming the characters 
from a charset decoder.

Now, I don't think we should rewrite the scanner in Daffodil to fix this. Such 
time as this gets rewritten for performance or other reasons, that would be 
when to improve it to work at the byte level.

Honestly I don't think implementing a byte-level scanner adds any value for the 
DFDL user community.
Right now getting TLOG to work, which uses delimited packed decimal, is the 
driving use case,

So, the technique I suggest is called "reduction to iso-8859-1". That is, a 
"binary delimited parser" is implemented by way of a "text delimited parser" 
using encoding iso-8859-1 under the covers.

In this encoding, every byte is a valid single-byte-wide character code. The 
correspondence to unicode code points is exact. I.e., the byte 0xF3 found in 
the data becomes unicode character U+00F3, which is "รณ" (aka LATIN SMALL LETTER 
O WITH ACUTE)

(Btw: I highly recommend this simple utf-8 tool: 
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi)

You must translate the delimiters from whatever charset encoding they are 
specified in, to bytes - which are then the iso-8859-1 character codes one is 
searching for.

For example: if dfdl:encoding="ebcdic-cp-1" dfdl:terminator="$" we must 
translate the $ from ebcdic to get byte 5B, and then determine the iso-8859-1 
character corresponding to 5B which is "[".

Then we artificially, in the implementation, change the encoding to iso-8859-1, 
and the terminator to "[", and textPadChar, textTrimChar to 'none'.

Once we have isolated the iso-8859-1 string, we can convert to bytes and then 
interpret it as packed or hexBinary data.

I recommend these restrictions to make the implementation as easy as possible 
for now.

In Daffodil delimited binary should require:

1) delimiters must not contain character class entities
2) all delimiter characters must have single-byte representations in the 
specified charset encoding
3) dfdl:encoding must not be a runtime expression and must be a byte-aligned 
encoding (not 7 bit, 6 bit, etc.).
3a) To insure reasonable diagnostic messages, dfdl:encoding must be single-byte 
wide encoding, and ascii-derived - practically speaking this means Daffodil 
would allow only us-ascii and iso-8859-1 encodings.
4) escape schemes must not be specified (dfdl:escapeSchemeRef="" or no 
definition in scope)
5) delimited binary elements must be byte aligned. (Cannot begin on a 4-bit 
boundary in the middle of a byte)
6) No support for raw/ byte value entities i.e., %#rHH; notation.

I'd be completely happy with separate JIRA tickets addressing enhancing the 
implementation to lift any of these restrictions (some certainly exist), but I 
wouldn't even create them as yet. I'd just write down these restrictions for 
our Daffodil-specific documentation - our release notes, and in code comments.

________________________________
From: Joshua Adams <[email protected]>
Sent: Wednesday, November 8, 2017 8:07:21 AM
To: [email protected]
Subject: Packed Decimal lengthKind="delimited"

I have been in the process of implementing support for packed decimal, BCD, and 
Ibm4690Packed binary formats and while I belive I have implemented the parsers 
and unparsers for these correctly, I am running into an issue getting the 
IBM4690-TLOG schema project running.  Both the ACE and SA schemas make use of 
lengthKind="delimited" with ':' as the separator.  Currently there is no 
support for delimited binary data in the codebase, as we did not have support 
for these packed formats and lengthKind="delimited" is only allowed on packed 
formats according to the spec.

So, I'm guessing I will need to add a binary delimited parser in order to 
handle this data as I am assuming that the the TextDelimitedParser will not 
work with binary data?  I just want to verify that I am headed in the right 
direction before committing a bunch of time to implementing a new delimited 
parser.

Thanks,

Josh

Reply via email to