Re: Question about gobbling up hex digits until arriving at a string

Mike Beckerle Tue, 20 Nov 2018 08:17:41 -0800

Roger,


I think you will need to use dfdl:byteOrder bigEndian with your hexBinary due 
to this.


I do think we will have to revert this "feature" of daffodil where hexBinary is 
sensitive to bit and byte order.


I have had a pending action time with the DFDL Working Group for over a hear to 
provide  write up of the hexBinary with length-in-bits functionality. When 
originally discussed, this idea was considered "not unreasonable", but it has 
been pending a write up that the workgroup can analyze with multiple 
implementations in mind for over a year.


I'm pretty sure though, that the way this works in Daffodil today will not be 
acceptable. The DFDL workgroup is very unlikely to accept any proposal that 
will change the behavior of existing schemas. Making the hex string in the 
infoset reverse because of byteOrder will be grossly backward-incompatible.


Furthermore the spec says the length units is implicitly always bytes for 
hexBinary. So we can't say the hexBinary behavior will change if lengthUnits is 
'bits' either without introducing backward incompatibility. I can't recall the 
alignment constraints, but there ought to be 4-bit alignment required as well.


Whatever we do to allow hexBinary with length in bits will have to be 100% 
backward compatible with all those constraints.


What we'll need is a daffodil-specific flag for turning on this new mode of 
behavior where hexBinary effectively behaves exactly like 
xs:nonNegativeInteger, but presented in the infoset as a hexadecimal number, 
instead of decimal.


________________________________
From: Steve Lawrence <[email protected]>
Sent: Tuesday, November 20, 2018 10:35:46 AM
To: [email protected]; Roger Costello
Subject: Re: Question about gobbling up hex digits until arriving at a string

Statement (a) is correct in our current implementation.

But now that I look at the spec, it does state that byteOrder only
applies to Number, Calendar, and Boolean types. So perhaps our behavior
is incorrect, or perhaps the DFDL spec may need an errata. The
discussion pasted below does seem like reasonable reasoning for how
byteOrder and bitOrder should play a role in hexBinary data, especially
when one takes into account non-byte size lengths, which were not
originally part of the DFDL spec.

- Steve

On 11/20/18 10:21 AM, Costello, Roger L. wrote:
>> The value of dfdl:byteOrder actually does affect
>> the order of hexBinary output.
>
> A few days ago, Mike posted a message saying that a hex array (i.e., 
> xs:hexBinary) is not affected by littleEndian/bigEndian:
>
>        Byte order doesn't apply to type xs:hexBinary
>        because a hexBinary is effectively a byte string,
>        and byte order only applies when more than
>        one byte is spanned by a single number.
>
> Now I'm confused. Which of the following statements is correct:
>
> (a) Byte order applies to xs:hexBinary.
> (b) Byte order does not apply to xs:hexBinary.
>
> /Roger
>
> -----Original Message-----
> From: Steve Lawrence <[email protected]>
> Sent: Tuesday, November 20, 2018 10:10 AM
> To: [email protected]; Costello, Roger L. <[email protected]>
> Subject: Re: Question about gobbling up hex digits until arriving at a string
>
>
> The value of dfdl:byteOrder actually does affect the order of hexBinary 
> output. Looking through the git log and my email history, I've found where 
> this decision was made.
>
> In March 2017 we added support for non-byte size lengths for hexBinary data. 
> This resulted in some discussions about how to handle canonicalization of 
> hexBinary data where there aren't full bytes of data (which non-byte size 
> lengths would allow). The XSD specification is mostly silent on this since it 
> states that hexBinary data must always represent full bytes. So some 
> interpretation was needed. I've copied and pasted the result of those 
> discussions from Mike that I think explains the reasoning why byteOrder (and 
> bitOrder) affect the hexBinary output.
>
> This gist is that you could think of the process as
>
> 1. Convert the specified length number of bits to a nonNegativeInteger
>    using byteOrder and bitOrder
> 2. Convert that logical value to a big-endian two's complement bit
>    string
> 3. Convert those bits to hexBinary
>
> The actual process is a bit more efficient than that, but that's the general 
> idea.
>
> The result is that if you don't want your bytes flipped in hexBinary data, 
> model it as bigEndian instead of littleEndian.
>
> - Steve
>
> Original discussion below:
>
>> I looked up the XPath xs:hexBinary constructor function, ended up at
>> this statement found in the XSD description of hexBinary:
>>
>>   hexBinary has a lexical representation where each binary octet
>>   is encoded as a character tuple, consisting of two hexadecimal
>>   digits ([0-9a-fA-F]) representing the octet code. For example,
>>   "0FB7" is a hex encoding for the 16-bit integer 4023 (whose
>>   binary representation is 111110110111).
>>
>> (They then say the "cannonical" representation doesn't use a-f
>> lowercase.)
>>
>> Note how the bit string they give has 12 bits in it, so they are
>> padding on the left with zeros to get full bytes.
>>
>> So what they're saying here is that hexBinary's lexical representation
>> is given in terms of numeric value equivalents.
>>
>> This achieves the canonicalization you were suggesting. It loses the
>> behavior where if everything is byte aligned and byte sized that bit
>> order and byte order don't matter, because they do matter to numbers.
>>
>> For DFDL this would mean we define the xs:hexBinary value you get, in
>> terms of xs:nonNegativeInteger of the same bits.
>>
>> But notice how the interpretation of the binary representation is
>> bigEndian MSBF relative to the binary data they give there, which is a
>> "logical" binary number. Still. They're stating that if the data is
>> the number 4023, then the hexBinary of it *is* "0FB7".
>>
>> So, if I store logical integer 4023 in 16 bits, I can do it bigEndian
>> MSBF, or littleEndian MSBF, or littleEndian LSBF. In all cases, if the
>> value stored would be parsed/unparsed as 4023, the hexBinary would be
>> "0FB7".
>>
>> Now, if I store it as 12 bits, which is capable of holding enough bits
>> to store that value as an unsignedInt, then bigEndian MSBF, starting
>> at bit 2 of a byte I get
>>
>>   XX111110 110111XX
>>
>> Stored LSBF littleEndian, numbering bits and byte RTL I get that exact
>> same picture, just numbered all backwards. But if I number everything
>> the normal LTR way, I get
>>
>>   110111XX XX111110
>>
>> The bytes in the file, to represent the value 4023 in these two
>> representations are extraordinarily different. But the hexBinary
>> representation of these 12 bit elements would be exactly the same.
>> I.e.,
>>
>>   <element name="lsbf" type="xs:hexBinary"
>>     dfdl:byteOrder="littleEndian"
>>     dfdl:bitOrder="leastSignificantBitFirst"
>>     dfdl:alignmentUnits="bits"
>>     dfdl:leadingSkip="2"
>>     dfdl:lengthUnits="bits"
>>     dfdl:length="12"/>
>>
>> vs. same thing but bigEndian, MSBF
>>
>> Now, let's look at a interesting case:
>>
>>   <element name="foo" dfdl:length="5"
>>     .... everything else as above lsbf />
>>
>> Suppose the byte is 01011010 (5A)
>>
>> The foo element is X10110XX. The value is 0x16 or 22 decimal.
>>
>> Now let's describe those exact same bits msbf.
>>
>>   <element name="bar" dfdl:leadingSkip="1" dfdl:length="5".... />
>>
>> These are the exact same 5 bits. We are just "coming at them" from the
>> other side.
>>
>> And the hexBinary for them would be... I think 0x16 also, or 22
>> decimal.
>>
>> Since this is less than 1 byte of data, byteOrder doesn't come into
>> play. BitOrder plays the role of isolating which bits we're talking
>> about, but the same 5 bits, once isolated, the bit positions don't
>> matter MSBF and LSBF are about the assignment of bit-positions to
>> place value of bit, but only the place value of the bit matters for
>> purposes of the numeric value.
>>
>> The implications of the above: hexBinary parser/unparser should share
>> lots of code with xs:nonNegativeInteger parser/unparser.
>
>
>
>
>
> On 11/20/18 9:00 AM, Costello, Roger L. wrote:
>> Thank you Steve!
>>
>> A very odd thing happened after I made those changes.
>>
>> Recall that the second group of binary is this: all the binary until "PE" is 
>> encountered.
>>
>> The PE data is actually 4 bytes (PE\0\0). Looking at the PE data in a hex 
>> editor I see this:
>>
>>       50 45 00 00
>>
>> So, after outputting the second group of binary, I output the PE data:
>>
>> <xs:element name="PE_Header">
>>     <xs:complexType>
>>         <xs:sequence>
>>             <xs:sequence dfdl:hiddenGroupRef="hidden_signature_Group" />
>>             <xs:element name='Signature' type='xs:string' 
>> dfdl:inputValueCalc='{
>>                 if (xs:string(../Hidden_signature) eq "50450000") then 
>> "PE\0\0"
>>                 else fn:error("signature PE\0\0 not present")
>>                 }'>
>>             </xs:element>
>>         </xs:sequence>
>>     </xs:complexType>
>> </xs:element>
>>
>> Here's the odd part: Somehow, the PE data got reversed! For my 
>> inputValueCalc to work, I needed to change the if-statement to this:
>>
>> if (xs:string(../Hidden_signature) eq "00004500") then "PE\0\0"
>>
>> Notice that I flipped 50450000 to 00004500.
>>
>> It seems that picking up the second group of binary has had the side effect 
>> of flipping the PE data.
>>
>> Note: in my DFDL schema I have this setting: byteOrder="littleEndian" (I 
>> think that is somehow related to what's happening).
>>
>> Can you explain what's happening Steve, please? Why is the PE data flipping?
>>
>> /Roger
>>
>> -----Original Message-----
>> From: Steve Lawrence <[email protected]>
>> Sent: Tuesday, November 20, 2018 8:23 AM
>> To: [email protected]; Costello, Roger L. <[email protected]>
>> Subject: Re: Question about gobbling up hex digits until arriving at a
>> string
>>
>> Looks like the issue in this case was the use of the dot character--my 
>> suggested regex wasn't completely correct. By default the dot character in 
>> regular expressions matches all characters EXCEPT for new line characters. 
>> And some of the bytes in the second Instruction area happen to be 0x0D, 
>> which is a carriage return and which dot does not match. So the regular 
>> expression I provided didn't actually match all bytes like I suggested it 
>> did.
>>
>> So you can replace the dot with a character class that matches all bytes 
>> (i.e. [\x00-\xFF]) to ensure those newline characters are matched. Also, the 
>> + needs to be made non-greedy by appending a question mark. Try changing the 
>> regular expressions to the following:
>>
>>   [\x00-\xFF]+?(?=This program cannot be run in DOS mode\.)
>>
>>   [\x00-\xFF]+?(?=PE)
>>
>> - Steve
>>
>> On 11/19/18 5:17 PM, Costello, Roger L. wrote:
>>> Thank you Steve and Mike!
>>>
>>> I have made progress, using your suggestions. I have almost got it.
>>>
>>> My input contains:
>>>
>>>   * A bunch of binary
>>>   * Then the string: This program cannot be run in DOS mode.
>>>   * Then a bunch more binary
>>>   * And then the string: PE
>>>
>>> Here's the DFDL code:
>>>
>>> <xs:elementname="DOS_Stub">
>>> <xs:complexType>
>>> <xs:sequence>
>>> <xs:element    name="Instructions"
>>>                                     type="xs:hexBinary"
>>>                                     dfdl:lengthKind="pattern"
>>>                                     dfdl:lengthPattern=".+(?=This
>>> program cannot be run in DOS mode\.)"/>
>>> <xs:element    name="Message"
>>>                                     type="xs:string"
>>>                                     dfdl:lengthUnits="characters"
>>>                                     dfdl:lengthKind="explicit"
>>>                                     dfdl:length="39"
>>>                                     dfdl:representation="text"
>>>                                     dfdl:encoding="ISO-8859-1"/>
>>> <xs:element    name="Instructions"
>>>                                     type="xs:hexBinary"
>>>                                     dfdl:lengthKind="pattern"
>>>                                     dfdl:lengthUnits="bytes"
>>>                                     dfdl:representation="binary"
>>>                                     dfdl:lengthPattern=".+(?=PE)"/>
>>> </xs:sequence> </xs:complexType> </xs:element>
>>>
>>> Parsing successfully gobbles up the first group of binary, then the
>>> first string, but fails to gobble up the second group of binary:
>>>
>>> <DOS_Stub>
>>> <Instructions>0E1FBA0E00B409CD21B8014CCD21</Instructions>
>>> <Message>This program cannot be run in DOS mode.</Message>
>>> <Instructions></Instructions> </DOS_Stub>
>>>
>>> Why is the second group of binary not being picked up?
>>>
>>> /Roger
>>>
>>> *From:* Mike Beckerle <[email protected]>
>>> *Sent:* Monday, November 19, 2018 2:00 PM
>>> *To:* [email protected]; Costello, Roger L.
>>> <[email protected]>
>>> *Subject:* Re: Question about gobbling up hex digits until arriving
>>> at a string
>>>
>>> Also,
>>>
>>> Set dfdl:encoding to 'iso-8859-1'.
>>>
>>> If you are using ASCII, then as soon as a byte with the 8th bit set
>>> is encountered, you won't get what you think.
>>>
>>> Encoding 'iso-8859-1' is the magic "bytes" encoding where every byte
>>> is one character no matter the byte value.
>>>
>>> ASCII, surprising to some people, is not at all like this.
>>>
>>> ASCII is 7-bit, and if a byte has the 8th bit set, it will causes a
>>> decode error, and you will instead get a
>>> Unicode-replacement-character created for that byte.
>>>
>>> This replacement character  usually looks like a stylized question
>>> mark (if you have a unicode font). But that won't match your regex
>>> because the code-point for the Unicode replacement character is
>>> U+FFFD.  The ranges in your regex won't accept these.
>>>
>>> ...mike beckerle
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> ----------
>>>
>>> *From:*Steve Lawrence <[email protected]
>>> <mailto:[email protected]>>
>>> *Sent:* Monday, November 19, 2018 1:47:57 PM
>>> *To:* [email protected] <mailto:[email protected]>;
>>> Roger Costello
>>> *Subject:* Re: Question about gobbling up hex digits until arriving
>>> at a string
>>>
>>> On second look, I think the issue is more clear. The regex you have is:
>>>
>>>    [\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)
>>>
>>> Those hex values are all ASCII characters, and could be rewritten like so:
>>>
>>>    [0-9A-Fa-f]+?(?=T)
>>>
>>> So your regex actually will only match data that contains those ASCII
>>> characters followed by the letter T. But I suspect your data isn't
>>> ASCII, it's actual binary data that could be anything. Since your
>>> data doesn't contain those ASCII characters, your pattern will fail
>>> to match and the matched length is considered zero. It then decode 39
>>> bytes of data, with the initial bytes being binary data followed by
>>> the beginning of the ASCII string.
>>>
>>> So the schema needs to be modified to either use a different regex or
>>> use some other method to determine where the data ends and the
>>> message begins. To me, it seems odd to have a binary format where the
>>> length of binary data is just some amount until it finds the letter
>>> 'T', so I would think a better description would exist. That said,
>>> such a regex would look like this:
>>>
>>>    [^T]+
>>>
>>> - Steve
>>>
>>>
>>> On 11/19/18 12:50 PM, Steve Lawrence wrote:
>>>  > Roger,
>>>  >
>>>  > I am unable to reproduce this issue. I've created a TDML file at
>>> the  > below link, which defines a schema and a test case with sample
>>> input  > data and expected infoset, based on your description.
>>>  >
>>>  >
>>> https://gist.github.com/stevedlawrence/c4051386c4ed58279dbcae1e75d082
>>> 1
>>> 8
>>>  >
>>>  > This can be tested with:
>>>  >
>>>  >   daffodil test -i hexPattern.tml
>>>  >
>>>  > And I get the output:
>>>  >
>>>  >   [Fail] hexPattern
>>>  >     Failure Information:
>>>  >       Left over data. Consumed 408 bit(s) with 16 bit(s) remaining.
>>>  >
>>>  >   Total: 1, Pass: 0, Fail: 1, Not Found: 0
>>>  >
>>>  > So it fails, but it fails because the schema does not consume the
>>>> trailing PE, so that's expected. The actual infoset does match the
>>>> expected infoset.
>>>  >
>>>  > Maybe your input data is different or there is some other property
>>> you  > have defined in dfdl:format that is changing the behavior?
>>>  >
>>>  > Thanks,
>>>  > - Steve
>>>  >
>>>  > On 11/17/18 10:54 AM, Costello, Roger L. wrote:
>>>  >> Hello DFDL Community,
>>>  >>
>>>  >> Within my input is this:
>>>  >>
>>>  >> - a series of bytes
>>>  >> - then the string: "This program cannot be run in DOS mode."
>>>  >> - then another series of bytes until arriving at this string: "PE"
>>>  >>
>>>  >> I figured that for the first series of bytes I would use
>>> xs:hexBinary whose length ends when getting to "T" (hex 54)  >>
>>>  >> <xs:element   name="Instructions_in_hex"
>>>  >>               type="xs:hexBinary"
>>>  >>               dfdl:lengthKind="pattern"
>>>  >>               
>>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>>  >>
>>>  >> The next item is a string of length 39  >>
>>>  >> <xs:element   name="Message"
>>>  >>               type="xs:string"
>>>  >>               dfdl:lengthUnits="characters"
>>>  >>               dfdl:lengthKind="explicit"
>>>  >>               dfdl:length="39" />
>>>  >>
>>>  >> The last item is a series of hex digits whose length ends when
>>> getting to "P"(hex 50)  >>
>>>  >> <xs:element   name="Instructions_in_hex"
>>>  >>               type="xs:hexBinary"
>>>  >>               dfdl:lengthKind="pattern"
>>>  >>               
>>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>>  >>
>>>  >> At the bottom of this message is the complete set of declarations.
>>>  >>
>>>  >> Unfortunately, it doesn't work. The first <Instructions_in_hex>
>>> picks up nothing. Then the <Message> element erroneously picks up a
>>> bunch of hex digits and the first part of the string "This program
>>> cannot be run in DOS mode.". Then it crashes.
>>>  >>
>>>  >> What am I doing wrong, please?  /Roger  >>  >> <xs:element
>>> name="DOS_Stub">
>>>  >>     <xs:complexType>
>>>  >>         <xs:sequence>
>>>  >>             <xs:element       name="Instructions_in_hex"
>>>  >>                       type="xs:hexBinary"
>>>  >>                       dfdl:lengthKind="pattern"
>>>  >>
>>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>>  >>             <xs:element       name="Message"
>>>  >>                       type="xs:string"
>>>  >>                       dfdl:lengthUnits="characters"
>>>  >>                       dfdl:lengthKind="explicit"
>>>  >>                       dfdl:length="39" />
>>>  >>             <xs:element       name="Instructions_in_hex"
>>>  >>                       type="xs:hexBinary"
>>>  >>                       dfdl:lengthKind="pattern"
>>>  >>
>>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>>  >>         </xs:sequence>
>>>  >>     </xs:complexType>
>>>  >> </xs:element>
>>>  >>
>>>  >
>>>
>>
>

Re: Question about gobbling up hex digits until arriving at a string

Reply via email to