Re: Large dfdl:length values on complex types

Steve Lawrence Mon, 14 Sep 2020 09:20:23 -0700

Yep, we would still call withBitLengthLimit which will restrict child
elements to that length. Fortunately, withBitLengthLimit doesn't
actually care if we've read that amount of data yet.


On 9/14/20 12:02 PM, Beckerle, Mike wrote:
> 
> A further observation: we would need to tell the I/O layer about the bounded 
> length of the complex element (which probably has to be in length units of 
> bytes/bits/fixed-width characters, not characters of a variable-width 
> encoding like UTF-8), for purposes of avoiding the children of the complex 
> type just happening to turn out to be larger than the explicit length, and 
> also to someday implement the lengthKind 'endOfParent' for the final child.
> ________________________________
> From: Beckerle, Mike <[email protected]>
> Sent: Monday, September 14, 2020 11:58 AM
> To: [email protected] <[email protected]>
> Subject: Re: Large dfdl:length values on complex types
> 
> Ok, so assuming the input source is a pipe/stream kind of thing, what exactly 
> are we hoping to achieve?
> 
> The scenario is that a complex element with an explicit length, and that 
> length is vast, is encountered.
> 
> We want to NOT check in advance that all the bytes are present, and just go 
> about parsing the children, and maybe we'll have enough, or maybe we will not 
> have enough data, but we don't want to grab it all in advance.
> 
> Seems like we just remove the upfront check. Proceed to parse the children.
> 
> We still need to save the value of the explicit length for use at the end of 
> the complex element in case there is a need for the ElementUnused region to 
> be skipped.
> 
> The parser behavior would change: we won't immediately backtrack on short 
> data in this case, but that's probably ok.
> 
> So it's a different backend combinator for explicit length complex types, but 
> seems to me the issue is just that the check is too aggressive, and not 
> consistent with streaming behavior.
> 
> Does that make sense?
> 
> 
> 
> ________________________________
> From: Steve Lawrence <[email protected]>
> Sent: Monday, September 14, 2020 11:34 AM
> To: [email protected] <[email protected]>
> Subject: Re: Large dfdl:length values on complex types
> 
> We could have a separate I/O path specific to files. Our current input
> API supports Array[Byte], ByteBuffer, and InputStream as ways to get data.
> 
> For first two, we create an "ByteBufferInputSource", which doesn't
> actually throw away data since it's all already in an array in memory.
> isDefinedForLength doesn't need to read anything to return an answer
> since it knows the full length of the data.
> 
> For the InputStream, we create a "BucketingInputSource" which is what
> does this caching/discarding stuff, and needs to read/cache the input
> stream, which is where we get into thie cache issue.
> 
> We could definitely create a new FileInputSource that could have
> specific logic for a java File/Path and only use this when we have
> length information (which I assume we can't get when dealing with things
> like pipes. We would still have to fall back to the BucketingInputSource
> for those cases.
> 
> But as Brandon points out, it doesn't solve the case when the input data
> is not from a file. And it also requires the user to use a new API
> function to create a File specific InputSource.
> 
> I've created DAFFODIL-2397 to track the API issue.
> 
> On 9/14/20 11:19 AM, Sloane, Brandon wrote:
>> Does that solve the issue? Not all data sources are files. A file centric 
>> path could be used as an optimization, but we also need to behave correctly 
>> on non-file data sources.
>>
>> Also, if we go this route, we need to be careful about what we consider to 
>> be a "file". An anonymous or named pipe are commonly used types of files 
>> which do provide a file-length.
>> ________________________________
>> From: Beckerle, Mike <[email protected]>
>> Sent: Monday, September 14, 2020 10:40 AM
>> To: [email protected] <[email protected]>
>> Subject: Re: Large dfdl:length values on complex types
>>
>> Is it possible to create a separate i/o path that is file centric, so that 
>> isDefinedForLength can answer without actually accessing any data at all?
>>
>> The notion that there is no tunable access suggests we do need an API change 
>> to a factory pattern so that the I/O abstractions that need them have access 
>> to tunables. Efficient I/O is just one of those areas where tuning is 
>> generally needed.
>>
>> ...mikeb
>>
>>
>> ________________________________
>> From: Steve Lawrence <[email protected]>
>> Sent: Monday, September 14, 2020 8:32:34 AM
>> To: [email protected] <[email protected]>
>> Subject: Re: Large dfdl:length values on complex types
>>
>>> The issue isn't so much backtracking, as fortracking. At the moment,
>>> we need to read in all of dfdl:length before we begin parsing the
>>> first time, so we get a problem even if we never attempt to
>>> backtrack.
>>
>> This is correct. I think this is the core issue. As best I can come up
>> with, the right fix is the removal (or very selective use) of
>> isDefinedForLength along with parsers handling some new end of stream
>> error. Not sure how big of a change that is or if maybe there's another
>> approach.
>>
>> I've created DAFFODIL-2395 to track this issue. I think this is probably
>> a blocker for the next release.
>>
>> Note that this isn't actually an issue for blobs. The BlobLengthParser
>> reads blobs in chunks with a default size of 4KB. And it only calls
>> isDefinedForLength for individual chunks, so never comes close the 256MB
>> limit. So large blobs are never an issue. The only issue here is with
>> complex or simple types with a specified length larger than 256MB. A
>> simple type of 256MB seems unlikely, but is very possible for complex types.
>>
>> Also note that the upcoming streaming changes won't help with this
>> issue. That is really just about outputting the infoset while we parse
>> rather than waiting until the end of parse, as well as the use of a SAX
>> API. It won't affect how we handle the data at all.
>>
>> On 9/11/20 3:04 PM, Sloane, Brandon wrote:
>>> We should be careful about making the cache too big due to the memory 
>>> overhead. I'm not sure where we are with support for streaming parses 
>>> (beyond the --stream option in the CLI, which assumes multiple distinct 
>>> documents that have been concatted), so in most cases, a multiple of this 
>>> overhead would be needed by the infoset itself anyway.
>>>
>>> The reason I bring up blobs is that they bypass the requirement to have a 
>>> large infoset. Imagine a 4GB document, consisting of a length prefix, 
>>> 3.99GB worth of binary blobs, and metadata. There is no good reason parsing 
>>> this document should require anywhere near 4GB memory overhead.
>>>
>>>> The DFDL spec allows implementations to have limits on backtracking 
>>>> capability, it just requires us to document them.
>>>
>>> The issue isn't so much backtracking, as fortracking. At the moment, we 
>>> need to read in all of dfdl:length before we begin parsing the first time, 
>>> so we get a problem even if we never attempt to backtrack.
>>>
>>>
>>>
>>> ________________________________
>>> From: Beckerle, Mike <[email protected]>
>>> Sent: Friday, September 11, 2020 2:51 PM
>>> To: [email protected] <[email protected]>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> We have tunable limits people have to sometimes enlarge. E.g. max size of a 
>>> regex match is limited.
>>>
>>> The DFDL spec allows implementations to have limits on backtracking 
>>> capability, it just requires us to document them.
>>>
>>> That said. We really don't want to have this sort of limit hard coded.
>>>
>>> A non-streamable file format that just stores the total length in a header 
>>> record is pretty common, and people will use those for really big pieces of 
>>> data these days.
>>>
>>> Can this 256MB just be a tunable constant that people can enlarge?
>>>
>>> From a security perspective, a stored length field is always something that 
>>> should be sanity checked - if clobbered, it could contain -1 and that could 
>>> be interpreted as maxInt or something, resulting in a denial-of-service 
>>> attack. It would be better to get a processing error in that case.
>>>
>>> Can we just check for complex type lengths against a tunable limit, and 
>>> error out before we even try to read it in?
>>>
>>> ________________________________
>>> From: Sloane, Brandon <[email protected]>
>>> Sent: Friday, September 11, 2020 1:53 PM
>>> To: [email protected] <[email protected]>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured 
>>> in GB since fat32 and ext1 (depending on settings; even ext4 can have a 
>>> limit as low as 16gb). All it takes is such a file to have a length prefix, 
>>> and we can very easily run into the limit. Combined with our support of 
>>> large binary blobs, it is not unreasonable that someone would want to use 
>>> DFDL on such a format.
>>>
>>> ________________________________
>>> From: Beckerle, Mike <[email protected]>
>>> Sent: Friday, September 11, 2020 1:24 PM
>>> To: [email protected] <[email protected]>
>>> Subject: Re: Large dfdl:length values on complex types
>>>
>>> Maybe a silly question, but why don't we just hit a tunable size limit 
>>> immediately before we "try to read" that data? 256MB is very big.
>>>
>>> Is this a real format, or a test case designed to push the boundaries?
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <[email protected]>
>>> Sent: Friday, September 11, 2020 1:14 PM
>>> To: [email protected] <[email protected]>
>>> Subject: Large dfdl:length values on complex types
>>>
>>> I recently came across an issue where we have something like this:
>>>
>>>   <xs:element name="length" type="xs:int" ... />
>>>   <xs:element name="data"
>>>     dfdl:lengthKind="explicit" dfdl:length="{ ../length }">
>>>     <xs:complexType>
>>>       <xs:sequence>
>>>         <xs:element name="field1" ... />
>>>         <xs:element name="field2" ... />
>>>         ...
>>>         <xs:element name="fieldN" ... />
>>>       </xs:sequence>
>>>     </xs:complexType>
>>>   </xs:element>
>>>
>>> So we have a length element and a complex data field that uses this
>>> length, and the data field is made up of a bunch of fields.
>>>
>>> The issue I come across is related to how we cache bytes in buckets for
>>> backtracking. As we fill up buckets, we currently limit the total amount
>>> cache size of the buckets to 256MB. So if someone ever parses more than
>>> 256MB of data and then tries to backtrack past that, we error. The idea
>>> being that we don't want to keep an infinite cache for potential
>>> backtracking and people should have realized that they went down the
>>> wrong branch much earlier.
>>>
>>> Though, a problem occurs with the complex types with a large specified
>>> length like above. When we have the complex type with expression
>>> ../length, before trying to parse any of the fields, we read that length
>>> number of bytes into our cache buckets to confirm that that number of
>>> bytes exists. The problem occurs if length is more than 256MB. In this
>>> case, we read length number of bytes, and start removing elements from
>>> the cache once we read more than 256MB.
>>>
>>> But once that succeeds and we read length bytes, we then try to start
>>> parsing the fields within the complex type, but we've removed those
>>> early cached bytes, and so we fail with an unhelpful backtracking exception.
>>>
>>> I'm not sure of the right solution here.
>>>
>>> Perhaps we shouldn't be throwing away these bytes when dealing with
>>> complex lengths?
>>>
>>> Or perhaps we shouldn't even be trying to determine if that many bytes
>>> are available when we have a specified length. Instead, maybe we should
>>> just set the bit limit to make sure we don't parse more than than that?
>>> And if eventually something tries to read a byte and there aren't enough
>>> and we hit that limit, only then do we fail. This feels like the right
>>> solution, but wanted to start a discussion to see if maybe there's a
>>> reason we try to read the full length, or maybe there's another alternative?
>>>
>>
>>
> 
>

Re: Large dfdl:length values on complex types

Reply via email to