RE: Large dfdl:length values on complex types

Interrante, John A (GE Research, US) Fri, 11 Sep 2020 11:41:10 -0700

Steve,

Can you elaborate on what you mean by setting the bit limit?  The DFDL 
specification has too many occurrences of "length" and "limit" for me to easily 
grasp what you propose to do, although setting a limit of how many bits to read 
for the current complex element and failing only once we find out there aren't 
enough bits OR we hit that limit makes sense to me.


I searched the Daffodil codebase and found that both "bitLimit0b" and 
"bitLimit1b" are used in the codebase.  Seems like you plan to tie into that 
somehow instead of trying to read that many bits into your cache buckets simply 
to confirm as early as possible that the given number of bits exists.  That 
sounds reasonable for explicit lengths that are larger than what you want to 
cache ahead of time (256MB).

I could envision you setting both a bit limit and reading bytes into the cache 
buckets when the explicit length is smaller than 256MB, but skipping the 
read-ahead check whenever the explicit length is greater than 256MB and relying 
on the bit limit to produce the right exception later.

John

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org> 
Sent: Friday, September 11, 2020 1:14 PM
To: dev@daffodil.apache.org
Subject: EXT: Large dfdl:length values on complex types

I recently came across an issue where we have something like this:

  <xs:element name="length" type="xs:int" ... />
  <xs:element name="data"
    dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="field1" ... />
        <xs:element name="field2" ... />
        ...
        <xs:element name="fieldN" ... />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So we have a length element and a complex data field that uses this length, and 
the data field is made up of a bunch of fields.

The issue I come across is related to how we cache bytes in buckets for 
backtracking. As we fill up buckets, we currently limit the total amount cache 
size of the buckets to 256MB. So if someone ever parses more than 256MB of data 
and then tries to backtrack past that, we error. The idea being that we don't 
want to keep an infinite cache for potential backtracking and people should 
have realized that they went down the wrong branch much earlier.

Though, a problem occurs with the complex types with a large specified length 
like above. When we have the complex type with expression ../length, before 
trying to parse any of the fields, we read that length number of bytes into our 
cache buckets to confirm that that number of bytes exists. The problem occurs 
if length is more than 256MB. In this case, we read length number of bytes, and 
start removing elements from the cache once we read more than 256MB.

But once that succeeds and we read length bytes, we then try to start parsing 
the fields within the complex type, but we've removed those early cached bytes, 
and so we fail with an unhelpful backtracking exception.

I'm not sure of the right solution here.

Perhaps we shouldn't be throwing away these bytes when dealing with complex 
lengths?

Or perhaps we shouldn't even be trying to determine if that many bytes are 
available when we have a specified length. Instead, maybe we should just set 
the bit limit to make sure we don't parse more than than that?
And if eventually something tries to read a byte and there aren't enough and we 
hit that limit, only then do we fail. This feels like the right solution, but 
wanted to start a discussion to see if maybe there's a reason we try to read 
the full length, or maybe there's another alternative?

RE: Large dfdl:length values on complex types

Reply via email to