Re: Large dfdl:length values on complex types

Steve Lawrence Fri, 11 Sep 2020 11:59:10 -0700

This was discovered with withe current NITF schema, which can have very
large chunks of types. So this is definitely a real world use case.
Here's the spot in the schema where this occurs:


https://github.com/DFDLSchemas/NITF/blob/master/src/main/resources/com/tresys/nitf/xsd/nitf.dfdl.xsd#L627

In an example large NITF file, the DataLength field is like 570MB so
this hits the issue of a complex type larger than 256MB. The temporary
work around is to remove the dfdl:length for this complex type, but that
isn't necessarily correct and doesn't allow dealing with padding in the
data.

I don't think there is currently a tunable limit that limits the size of
complex types. Maybe there is one for simple types? I'm not sure. Or if
it is, it must currently be bigger than the 256MB bucket size limit.

- Steve


On 9/11/20 1:24 PM, Beckerle, Mike wrote:
> Maybe a silly question, but why don't we just hit a tunable size limit 
> immediately before we "try to read" that data? 256MB is very big.
> 
> Is this a real format, or a test case designed to push the boundaries?
> 
> 
> ________________________________
> From: Steve Lawrence <[email protected]>
> Sent: Friday, September 11, 2020 1:14 PM
> To: [email protected] <[email protected]>
> Subject: Large dfdl:length values on complex types
> 
> I recently came across an issue where we have something like this:
> 
>   <xs:element name="length" type="xs:int" ... />
>   <xs:element name="data"
>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>     <xs:complexType>
>       <xs:sequence>
>         <xs:element name="field1" ... />
>         <xs:element name="field2" ... />
>         ...
>         <xs:element name="fieldN" ... />
>       </xs:sequence>
>     </xs:complexType>
>   </xs:element>
> 
> So we have a length element and a complex data field that uses this
> length, and the data field is made up of a bunch of fields.
> 
> The issue I come across is related to how we cache bytes in buckets for
> backtracking. As we fill up buckets, we currently limit the total amount
> cache size of the buckets to 256MB. So if someone ever parses more than
> 256MB of data and then tries to backtrack past that, we error. The idea
> being that we don't want to keep an infinite cache for potential
> backtracking and people should have realized that they went down the
> wrong branch much earlier.
> 
> Though, a problem occurs with the complex types with a large specified
> length like above. When we have the complex type with expression
> ../length, before trying to parse any of the fields, we read that length
> number of bytes into our cache buckets to confirm that that number of
> bytes exists. The problem occurs if length is more than 256MB. In this
> case, we read length number of bytes, and start removing elements from
> the cache once we read more than 256MB.
> 
> But once that succeeds and we read length bytes, we then try to start
> parsing the fields within the complex type, but we've removed those
> early cached bytes, and so we fail with an unhelpful backtracking exception.
> 
> I'm not sure of the right solution here.
> 
> Perhaps we shouldn't be throwing away these bytes when dealing with
> complex lengths?
> 
> Or perhaps we shouldn't even be trying to determine if that many bytes
> are available when we have a specified length. Instead, maybe we should
> just set the bit limit to make sure we don't parse more than than that?
> And if eventually something tries to read a byte and there aren't enough
> and we hit that limit, only then do we fail. This feels like the right
> solution, but wanted to start a discussion to see if maybe there's a
> reason we try to read the full length, or maybe there's another alternative?
>

Re: Large dfdl:length values on complex types

Reply via email to