Also,

Set dfdl:encoding to 'iso-8859-1'.


If you are using ASCII, then as soon as a byte with the 8th bit set is 
encountered, you won't get what you think.


Encoding 'iso-8859-1' is the magic "bytes" encoding where every byte is one 
character no matter the byte value.


ASCII, surprising to some people, is not at all like this.


ASCII is 7-bit, and if a byte has the 8th bit set, it will causes a decode 
error, and you will instead get a Unicode-replacement-character created for 
that byte.


This replacement character  usually looks like a stylized question mark (if you 
have a unicode font). But that won't match your regex because the code-point 
for the Unicode replacement character is U+FFFD.  The ranges in your regex 
won't accept these.


...mike beckerle


________________________________
From: Steve Lawrence <[email protected]>
Sent: Monday, November 19, 2018 1:47:57 PM
To: [email protected]; Roger Costello
Subject: Re: Question about gobbling up hex digits until arriving at a string

On second look, I think the issue is more clear. The regex you have is:

  [\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)

Those hex values are all ASCII characters, and could be rewritten like so:

  [0-9A-Fa-f]+?(?=T)

So your regex actually will only match data that contains those ASCII
characters followed by the letter T. But I suspect your data isn't
ASCII, it's actual binary data that could be anything. Since your data
doesn't contain those ASCII characters, your pattern will fail to match
and the matched length is considered zero. It then decode 39 bytes of
data, with the initial bytes being binary data followed by the beginning
of the ASCII string.

So the schema needs to be modified to either use a different regex or
use some other method to determine where the data ends and the message
begins. To me, it seems odd to have a binary format where the length of
binary data is just some amount until it finds the letter 'T', so I
would think a better description would exist. That said, such a regex
would look like this:

  [^T]+

- Steve


On 11/19/18 12:50 PM, Steve Lawrence wrote:
> Roger,
>
> I am unable to reproduce this issue. I've created a TDML file at the
> below link, which defines a schema and a test case with sample input
> data and expected infoset, based on your description.
>
>   https://gist.github.com/stevedlawrence/c4051386c4ed58279dbcae1e75d08218
>
> This can be tested with:
>
>   daffodil test -i hexPattern.tml
>
> And I get the output:
>
>   [Fail] hexPattern
>     Failure Information:
>       Left over data. Consumed 408 bit(s) with 16 bit(s) remaining.
>
>   Total: 1, Pass: 0, Fail: 1, Not Found: 0
>
> So it fails, but it fails because the schema does not consume the
> trailing PE, so that's expected. The actual infoset does match the
> expected infoset.
>
> Maybe your input data is different or there is some other property you
> have defined in dfdl:format that is changing the behavior?
>
> Thanks,
> - Steve
>
> On 11/17/18 10:54 AM, Costello, Roger L. wrote:
>> Hello DFDL Community,
>>
>> Within my input is this:
>>
>> - a series of bytes
>> - then the string: "This program cannot be run in DOS mode."
>> - then another series of bytes until arriving at this string: "PE"
>>
>> I figured that for the first series of bytes I would use xs:hexBinary whose 
>> length ends when getting to "T" (hex 54)
>>
>> <xs:element   name="Instructions_in_hex"
>>               type="xs:hexBinary"
>>               dfdl:lengthKind="pattern"
>>               dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>
>> The next item is a string of length 39
>>
>> <xs:element   name="Message"
>>               type="xs:string"
>>               dfdl:lengthUnits="characters"
>>               dfdl:lengthKind="explicit"
>>               dfdl:length="39" />
>>
>> The last item is a series of hex digits whose length ends when getting to 
>> "P"(hex 50)
>>
>> <xs:element   name="Instructions_in_hex"
>>               type="xs:hexBinary"
>>               dfdl:lengthKind="pattern"
>>               dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>
>> At the bottom of this message is the complete set of declarations.
>>
>> Unfortunately, it doesn't work. The first <Instructions_in_hex> picks up 
>> nothing. Then the <Message> element erroneously picks up a bunch of hex 
>> digits and the first part of the string "This program cannot be run in DOS 
>> mode.". Then it crashes.
>>
>> What am I doing wrong, please?  /Roger
>>
>> <xs:element name="DOS_Stub">
>>     <xs:complexType>
>>         <xs:sequence>
>>             <xs:element       name="Instructions_in_hex"
>>                       type="xs:hexBinary"
>>                       dfdl:lengthKind="pattern"
>>                       
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>             <xs:element       name="Message"
>>                       type="xs:string"
>>                       dfdl:lengthUnits="characters"
>>                       dfdl:lengthKind="explicit"
>>                       dfdl:length="39" />
>>             <xs:element       name="Instructions_in_hex"
>>                       type="xs:hexBinary"
>>                       dfdl:lengthKind="pattern"
>>                       
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>         </xs:sequence>
>>     </xs:complexType>
>> </xs:element>
>>
>

Reply via email to