A good example of this and what I'm suggesting is in
SpecifiedLengthParserBase. This parser is created when theres an
dfdl:lengthKind="explict" and dfdl:length property. The way this parser
works is it first figures out what the specified length is:

  val maybeNBits = getBitLength(pState)

In this case, this is returning a value greater than 256MB. We then
check to see if there are actually this many bits available in the input
stream:

  if (!dis.isDefinedForLength(nBits)) {
    PENotEnoughBits(pState, nBits, dis.remainingBits)
    return
  }

This isDefinedForLength function is what reads nBits from the underlying
input stream and caches them. The problem here is that if nbits > 256MB,
then this function also starts throwing away bytes to keep the cache
small. So even though we know it's define for that length, we won't
actually have the first chunk of bytes available.

After that, we try to actually parse a simple/complex type with the
following:

  val isLimitOk: Boolean = dis.withBitLengthLimit(nBits) {
    eParser.parse1(pState)
  }

So we set a bit limit to the number of bits, and then parse whatever
we're parsing (e.g. complex/simple type). The bitLimit prevents that
child parser from ever parsing beyond nBits, and because
isDefinedForLength succeeded these parsers can assume that they will
have the appropriate bits avialable.

The second idea I mentioned was to essentially get rid of the
isDefinedForLength check, but still keep the withBitLengthLimit call.
This means the child eParser.parse() call can no longer assume that bits
are available and they will have to handle something like an EndOfStream
error and create the same error as if isDefinedForLength failed. This
adds complications because the eParser can be lots of different parsers.
It could be a simple type parser, a complex type parser, a sequence of a
many types parser, etc.

I think this would be a pretty large change to the DataInputStream API.
Right now, because we always check isDefinedForLength, we can always
assume calls like getSignedLong will succeed. If we remove the
isDefinedForLength check, calls to DataInputStream functions can no
longer make assumptions about having enough data, and must handle
EndOfStream.

It also changes out the bucketing/caching stuff works in the
BuckingInputSourceDataInptuStream, since isDefinedForLength is what
fills the buckes. Instead, that needs to happen on just reads from the
underlying stream. Not a big change, but a change nonetheless.

It doesn't seem unreasonble to only call isDefinedForLength for smaller
lengths so we can sort of quickly fill the cache, but the
DataInputStream and callers of it's functions still need to be changed
because they can no longer assume that it's been called.



On 9/11/20 2:40 PM, Interrante, John A (GE Research, US) wrote:
> Steve,
> 
> Can you elaborate on what you mean by setting the bit limit?  The DFDL 
> specification has too many occurrences of "length" and "limit" for me to 
> easily grasp what you propose to do, although setting a limit of how many 
> bits to read for the current complex element and failing only once we find 
> out there aren't enough bits OR we hit that limit makes sense to me.
> 
> I searched the Daffodil codebase and found that both "bitLimit0b" and 
> "bitLimit1b" are used in the codebase.  Seems like you plan to tie into that 
> somehow instead of trying to read that many bits into your cache buckets 
> simply to confirm as early as possible that the given number of bits exists.  
> That sounds reasonable for explicit lengths that are larger than what you 
> want to cache ahead of time (256MB).
> 
> I could envision you setting both a bit limit and reading bytes into the 
> cache buckets when the explicit length is smaller than 256MB, but skipping 
> the read-ahead check whenever the explicit length is greater than 256MB and 
> relying on the bit limit to produce the right exception later.
> 
> John
> 
> -----Original Message-----
> From: Steve Lawrence <slawre...@apache.org> 
> Sent: Friday, September 11, 2020 1:14 PM
> To: dev@daffodil.apache.org
> Subject: EXT: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this length, 
> and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for 
> backtracking. As we fill up buckets, we currently limit the total amount 
> cache size of the buckets to 256MB. So if someone ever parses more than 256MB 
> of data and then tries to backtrack past that, we error. The idea being that 
> we don't want to keep an infinite cache for potential backtracking and people 
> should have realized that they went down the wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified length 
> like above. When we have the complex type with expression ../length, before 
> trying to parse any of the fields, we read that length number of bytes into 
> our cache buckets to confirm that that number of bytes exists. The problem 
> occurs if length is more than 256MB. In this case, we read length number of 
> bytes, and start removing elements from the cache once we read more than 
> 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start parsing 
> the fields within the complex type, but we've removed those early cached 
> bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with complex 
> lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes are 
> available when we have a specified length. Instead, maybe we should just set 
> the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough and 
> we hit that limit, only then do we fail. This feels like the right solution, 
> but wanted to start a discussion to see if maybe there's a reason we try to 
> read the full length, or maybe there's another alternative?
> 

Reply via email to