Thank you Steve!
A very odd thing happened after I made those changes.
Recall that the second group of binary is this: all the binary until "PE" is
encountered.
The PE data is actually 4 bytes (PE\0\0). Looking at the PE data in a hex
editor I see this:
50 45 00 00
So, after outputting the second group of binary, I output the PE data:
<xs:element name="PE_Header">
<xs:complexType>
<xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="hidden_signature_Group" />
<xs:element name='Signature' type='xs:string' dfdl:inputValueCalc='{
if (xs:string(../Hidden_signature) eq "50450000") then "PE\0\0"
else fn:error("signature PE\0\0 not present")
}'>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
Here's the odd part: Somehow, the PE data got reversed! For my inputValueCalc
to work, I needed to change the if-statement to this:
if (xs:string(../Hidden_signature) eq "00004500") then "PE\0\0"
Notice that I flipped 50450000 to 00004500.
It seems that picking up the second group of binary has had the side effect of
flipping the PE data.
Note: in my DFDL schema I have this setting: byteOrder="littleEndian" (I think
that is somehow related to what's happening).
Can you explain what's happening Steve, please? Why is the PE data flipping?
/Roger
-----Original Message-----
From: Steve Lawrence <[email protected]>
Sent: Tuesday, November 20, 2018 8:23 AM
To: [email protected]; Costello, Roger L. <[email protected]>
Subject: Re: Question about gobbling up hex digits until arriving at a string
Looks like the issue in this case was the use of the dot character--my
suggested regex wasn't completely correct. By default the dot character in
regular expressions matches all characters EXCEPT for new line characters. And
some of the bytes in the second Instruction area happen to be 0x0D, which is a
carriage return and which dot does not match. So the regular expression I
provided didn't actually match all bytes like I suggested it did.
So you can replace the dot with a character class that matches all bytes (i.e.
[\x00-\xFF]) to ensure those newline characters are matched. Also, the + needs
to be made non-greedy by appending a question mark. Try changing the regular
expressions to the following:
[\x00-\xFF]+?(?=This program cannot be run in DOS mode\.)
[\x00-\xFF]+?(?=PE)
- Steve
On 11/19/18 5:17 PM, Costello, Roger L. wrote:
> Thank you Steve and Mike!
>
> I have made progress, using your suggestions. I have almost got it.
>
> My input contains:
>
> * A bunch of binary
> * Then the string: This program cannot be run in DOS mode.
> * Then a bunch more binary
> * And then the string: PE
>
> Here's the DFDL code:
>
> <xs:elementname="DOS_Stub">
> <xs:complexType>
> <xs:sequence>
> <xs:element name="Instructions"
> type="xs:hexBinary"
> dfdl:lengthKind="pattern"
> dfdl:lengthPattern=".+(?=This
> program cannot be run in DOS mode\.)"/>
> <xs:element name="Message"
> type="xs:string"
> dfdl:lengthUnits="characters"
> dfdl:lengthKind="explicit"
> dfdl:length="39"
> dfdl:representation="text"
> dfdl:encoding="ISO-8859-1"/>
> <xs:element name="Instructions"
> type="xs:hexBinary"
> dfdl:lengthKind="pattern"
> dfdl:lengthUnits="bytes"
> dfdl:representation="binary"
> dfdl:lengthPattern=".+(?=PE)"/>
> </xs:sequence> </xs:complexType> </xs:element>
>
> Parsing successfully gobbles up the first group of binary, then the
> first string, but fails to gobble up the second group of binary:
>
> <DOS_Stub>
> <Instructions>0E1FBA0E00B409CD21B8014CCD21</Instructions>
> <Message>This program cannot be run in DOS mode.</Message>
> <Instructions></Instructions> </DOS_Stub>
>
> Why is the second group of binary not being picked up?
>
> /Roger
>
> *From:* Mike Beckerle <[email protected]>
> *Sent:* Monday, November 19, 2018 2:00 PM
> *To:* [email protected]; Costello, Roger L.
> <[email protected]>
> *Subject:* Re: Question about gobbling up hex digits until arriving at
> a string
>
> Also,
>
> Set dfdl:encoding to 'iso-8859-1'.
>
> If you are using ASCII, then as soon as a byte with the 8th bit set is
> encountered, you won't get what you think.
>
> Encoding 'iso-8859-1' is the magic "bytes" encoding where every byte
> is one character no matter the byte value.
>
> ASCII, surprising to some people, is not at all like this.
>
> ASCII is 7-bit, and if a byte has the 8th bit set, it will causes a
> decode error, and you will instead get a Unicode-replacement-character
> created for that byte.
>
> This replacement character usually looks like a stylized question
> mark (if you have a unicode font). But that won't match your regex
> because the code-point for the Unicode replacement character is
> U+FFFD. The ranges in your regex won't accept these.
>
> ...mike beckerle
>
> ----------------------------------------------------------------------
> ----------
>
> *From:*Steve Lawrence <[email protected]
> <mailto:[email protected]>>
> *Sent:* Monday, November 19, 2018 1:47:57 PM
> *To:* [email protected] <mailto:[email protected]>;
> Roger Costello
> *Subject:* Re: Question about gobbling up hex digits until arriving at
> a string
>
> On second look, I think the issue is more clear. The regex you have is:
>
> [\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)
>
> Those hex values are all ASCII characters, and could be rewritten like so:
>
> [0-9A-Fa-f]+?(?=T)
>
> So your regex actually will only match data that contains those ASCII
> characters followed by the letter T. But I suspect your data isn't
> ASCII, it's actual binary data that could be anything. Since your data
> doesn't contain those ASCII characters, your pattern will fail to
> match and the matched length is considered zero. It then decode 39
> bytes of data, with the initial bytes being binary data followed by
> the beginning of the ASCII string.
>
> So the schema needs to be modified to either use a different regex or
> use some other method to determine where the data ends and the message
> begins. To me, it seems odd to have a binary format where the length
> of binary data is just some amount until it finds the letter 'T', so I
> would think a better description would exist. That said, such a regex
> would look like this:
>
> [^T]+
>
> - Steve
>
>
> On 11/19/18 12:50 PM, Steve Lawrence wrote:
> > Roger,
> >
> > I am unable to reproduce this issue. I've created a TDML file at
> the > below link, which defines a schema and a test case with sample
> input > data and expected infoset, based on your description.
> >
> >
> https://gist.github.com/stevedlawrence/c4051386c4ed58279dbcae1e75d0821
> 8
> >
> > This can be tested with:
> >
> > daffodil test -i hexPattern.tml
> >
> > And I get the output:
> >
> > [Fail] hexPattern
> > Failure Information:
> > Left over data. Consumed 408 bit(s) with 16 bit(s) remaining.
> >
> > Total: 1, Pass: 0, Fail: 1, Not Found: 0
> >
> > So it fails, but it fails because the schema does not consume the
> > trailing PE, so that's expected. The actual infoset does match the
> > expected infoset.
> >
> > Maybe your input data is different or there is some other property
> you > have defined in dfdl:format that is changing the behavior?
> >
> > Thanks,
> > - Steve
> >
> > On 11/17/18 10:54 AM, Costello, Roger L. wrote:
> >> Hello DFDL Community,
> >>
> >> Within my input is this:
> >>
> >> - a series of bytes
> >> - then the string: "This program cannot be run in DOS mode."
> >> - then another series of bytes until arriving at this string: "PE"
> >>
> >> I figured that for the first series of bytes I would use
> xs:hexBinary whose length ends when getting to "T" (hex 54) >>
> >> <xs:element name="Instructions_in_hex"
> >> type="xs:hexBinary"
> >> dfdl:lengthKind="pattern"
> >>
> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
> >>
> >> The next item is a string of length 39 >>
> >> <xs:element name="Message"
> >> type="xs:string"
> >> dfdl:lengthUnits="characters"
> >> dfdl:lengthKind="explicit"
> >> dfdl:length="39" />
> >>
> >> The last item is a series of hex digits whose length ends when
> getting to "P"(hex 50) >>
> >> <xs:element name="Instructions_in_hex"
> >> type="xs:hexBinary"
> >> dfdl:lengthKind="pattern"
> >>
> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
> >>
> >> At the bottom of this message is the complete set of declarations.
> >>
> >> Unfortunately, it doesn't work. The first <Instructions_in_hex>
> picks up nothing. Then the <Message> element erroneously picks up a
> bunch of hex digits and the first part of the string "This program
> cannot be run in DOS mode.". Then it crashes.
> >>
> >> What am I doing wrong, please? /Roger >> >> <xs:element
> name="DOS_Stub">
> >> <xs:complexType>
> >> <xs:sequence>
> >> <xs:element name="Instructions_in_hex"
> >> type="xs:hexBinary"
> >> dfdl:lengthKind="pattern"
> >>
> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
> >> <xs:element name="Message"
> >> type="xs:string"
> >> dfdl:lengthUnits="characters"
> >> dfdl:lengthKind="explicit"
> >> dfdl:length="39" />
> >> <xs:element name="Instructions_in_hex"
> >> type="xs:hexBinary"
> >> dfdl:lengthKind="pattern"
> >>
> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
> >> </xs:sequence>
> >> </xs:complexType>
> >> </xs:element>
> >>
> >
>