Looks like the issue in this case was the use of the dot character--my suggested regex wasn't completely correct. By default the dot character in regular expressions matches all characters EXCEPT for new line characters. And some of the bytes in the second Instruction area happen to be 0x0D, which is a carriage return and which dot does not match. So the regular expression I provided didn't actually match all bytes like I suggested it did.
So you can replace the dot with a character class that matches all bytes (i.e. [\x00-\xFF]) to ensure those newline characters are matched. Also, the + needs to be made non-greedy by appending a question mark. Try changing the regular expressions to the following: [\x00-\xFF]+?(?=This program cannot be run in DOS mode\.) [\x00-\xFF]+?(?=PE) - Steve On 11/19/18 5:17 PM, Costello, Roger L. wrote: > Thank you Steve and Mike! > > I have made progress, using your suggestions. I have almost got it. > > My input contains: > > * A bunch of binary > * Then the string: This program cannot be run in DOS mode. > * Then a bunch more binary > * And then the string: PE > > Here’s the DFDL code: > > <xs:elementname="DOS_Stub"> > <xs:complexType> > <xs:sequence> > <xs:element name="Instructions" > type="xs:hexBinary" > dfdl:lengthKind="pattern" > dfdl:lengthPattern=".+(?=This program > cannot > be run in DOS mode\.)"/> > <xs:element name="Message" > type="xs:string" > dfdl:lengthUnits="characters" > dfdl:lengthKind="explicit" > dfdl:length="39" > dfdl:representation="text" > dfdl:encoding="ISO-8859-1"/> > <xs:element name="Instructions" > type="xs:hexBinary" > dfdl:lengthKind="pattern" > dfdl:lengthUnits="bytes" > dfdl:representation="binary" > dfdl:lengthPattern=".+(?=PE)"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > Parsing successfully gobbles up the first group of binary, then the first > string, but fails to gobble up the second group of binary: > > <DOS_Stub> > <Instructions>0E1FBA0E00B409CD21B8014CCD21</Instructions> > <Message>This program cannot be run in DOS mode.</Message> > <Instructions></Instructions> > </DOS_Stub> > > Why is the second group of binary not being picked up? > > /Roger > > *From:* Mike Beckerle <[email protected]> > *Sent:* Monday, November 19, 2018 2:00 PM > *To:* [email protected]; Costello, Roger L. <[email protected]> > *Subject:* Re: Question about gobbling up hex digits until arriving at a > string > > Also, > > Set dfdl:encoding to 'iso-8859-1'. > > If you are using ASCII, then as soon as a byte with the 8th bit set is > encountered, you won't get what you think. > > Encoding 'iso-8859-1' is the magic "bytes" encoding where every byte is one > character no matter the byte value. > > ASCII, surprising to some people, is not at all like this. > > ASCII is 7-bit, and if a byte has the 8th bit set, it will causes a decode > error, and you will instead get a Unicode-replacement-character created for > that > byte. > > This replacement character usually looks like a stylized question mark (if > you > have a unicode font). But that won't match your regex because the code-point > for > the Unicode replacement character is U+FFFD. The ranges in your regex won't > accept these. > > ...mike beckerle > > -------------------------------------------------------------------------------- > > *From:*Steve Lawrence <[email protected] <mailto:[email protected]>> > *Sent:* Monday, November 19, 2018 1:47:57 PM > *To:* [email protected] <mailto:[email protected]>; Roger > Costello > *Subject:* Re: Question about gobbling up hex digits until arriving at a > string > > On second look, I think the issue is more clear. The regex you have is: > > [\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54) > > Those hex values are all ASCII characters, and could be rewritten like so: > > [0-9A-Fa-f]+?(?=T) > > So your regex actually will only match data that contains those ASCII > characters followed by the letter T. But I suspect your data isn't > ASCII, it's actual binary data that could be anything. Since your data > doesn't contain those ASCII characters, your pattern will fail to match > and the matched length is considered zero. It then decode 39 bytes of > data, with the initial bytes being binary data followed by the beginning > of the ASCII string. > > So the schema needs to be modified to either use a different regex or > use some other method to determine where the data ends and the message > begins. To me, it seems odd to have a binary format where the length of > binary data is just some amount until it finds the letter 'T', so I > would think a better description would exist. That said, such a regex > would look like this: > > [^T]+ > > - Steve > > > On 11/19/18 12:50 PM, Steve Lawrence wrote: > > Roger, > > > > I am unable to reproduce this issue. I've created a TDML file at the > > below link, which defines a schema and a test case with sample input > > data and expected infoset, based on your description. > > > > https://gist.github.com/stevedlawrence/c4051386c4ed58279dbcae1e75d08218 > > > > This can be tested with: > > > > daffodil test -i hexPattern.tml > > > > And I get the output: > > > > [Fail] hexPattern > > Failure Information: > > Left over data. Consumed 408 bit(s) with 16 bit(s) remaining. > > > > Total: 1, Pass: 0, Fail: 1, Not Found: 0 > > > > So it fails, but it fails because the schema does not consume the > > trailing PE, so that's expected. The actual infoset does match the > > expected infoset. > > > > Maybe your input data is different or there is some other property you > > have defined in dfdl:format that is changing the behavior? > > > > Thanks, > > - Steve > > > > On 11/17/18 10:54 AM, Costello, Roger L. wrote: > >> Hello DFDL Community, > >> > >> Within my input is this: > >> > >> - a series of bytes > >> - then the string: "This program cannot be run in DOS mode." > >> - then another series of bytes until arriving at this string: "PE" > >> > >> I figured that for the first series of bytes I would use xs:hexBinary > whose > length ends when getting to "T" (hex 54) > >> > >> <xs:element name="Instructions_in_hex" > >> type="xs:hexBinary" > >> dfdl:lengthKind="pattern" > >> > dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" /> > >> > >> The next item is a string of length 39 > >> > >> <xs:element name="Message" > >> type="xs:string" > >> dfdl:lengthUnits="characters" > >> dfdl:lengthKind="explicit" > >> dfdl:length="39" /> > >> > >> The last item is a series of hex digits whose length ends when getting to > "P"(hex 50) > >> > >> <xs:element name="Instructions_in_hex" > >> type="xs:hexBinary" > >> dfdl:lengthKind="pattern" > >> > dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" /> > >> > >> At the bottom of this message is the complete set of declarations. > >> > >> Unfortunately, it doesn't work. The first <Instructions_in_hex> picks up > nothing. Then the <Message> element erroneously picks up a bunch of hex > digits > and the first part of the string "This program cannot be run in DOS mode.". > Then > it crashes. > >> > >> What am I doing wrong, please? /Roger > >> > >> <xs:element name="DOS_Stub"> > >> <xs:complexType> > >> <xs:sequence> > >> <xs:element name="Instructions_in_hex" > >> type="xs:hexBinary" > >> dfdl:lengthKind="pattern" > >> > dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" /> > >> <xs:element name="Message" > >> type="xs:string" > >> dfdl:lengthUnits="characters" > >> dfdl:lengthKind="explicit" > >> dfdl:length="39" /> > >> <xs:element name="Instructions_in_hex" > >> type="xs:hexBinary" > >> dfdl:lengthKind="pattern" > >> > dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" /> > >> </xs:sequence> > >> </xs:complexType> > >> </xs:element> > >> > > >
