Yep, we would still call withBitLengthLimit which will restrict child elements to that length. Fortunately, withBitLengthLimit doesn't actually care if we've read that amount of data yet.
On 9/14/20 12:02 PM, Beckerle, Mike wrote: > > A further observation: we would need to tell the I/O layer about the bounded > length of the complex element (which probably has to be in length units of > bytes/bits/fixed-width characters, not characters of a variable-width > encoding like UTF-8), for purposes of avoiding the children of the complex > type just happening to turn out to be larger than the explicit length, and > also to someday implement the lengthKind 'endOfParent' for the final child. > ________________________________ > From: Beckerle, Mike <[email protected]> > Sent: Monday, September 14, 2020 11:58 AM > To: [email protected] <[email protected]> > Subject: Re: Large dfdl:length values on complex types > > Ok, so assuming the input source is a pipe/stream kind of thing, what exactly > are we hoping to achieve? > > The scenario is that a complex element with an explicit length, and that > length is vast, is encountered. > > We want to NOT check in advance that all the bytes are present, and just go > about parsing the children, and maybe we'll have enough, or maybe we will not > have enough data, but we don't want to grab it all in advance. > > Seems like we just remove the upfront check. Proceed to parse the children. > > We still need to save the value of the explicit length for use at the end of > the complex element in case there is a need for the ElementUnused region to > be skipped. > > The parser behavior would change: we won't immediately backtrack on short > data in this case, but that's probably ok. > > So it's a different backend combinator for explicit length complex types, but > seems to me the issue is just that the check is too aggressive, and not > consistent with streaming behavior. > > Does that make sense? > > > > ________________________________ > From: Steve Lawrence <[email protected]> > Sent: Monday, September 14, 2020 11:34 AM > To: [email protected] <[email protected]> > Subject: Re: Large dfdl:length values on complex types > > We could have a separate I/O path specific to files. Our current input > API supports Array[Byte], ByteBuffer, and InputStream as ways to get data. > > For first two, we create an "ByteBufferInputSource", which doesn't > actually throw away data since it's all already in an array in memory. > isDefinedForLength doesn't need to read anything to return an answer > since it knows the full length of the data. > > For the InputStream, we create a "BucketingInputSource" which is what > does this caching/discarding stuff, and needs to read/cache the input > stream, which is where we get into thie cache issue. > > We could definitely create a new FileInputSource that could have > specific logic for a java File/Path and only use this when we have > length information (which I assume we can't get when dealing with things > like pipes. We would still have to fall back to the BucketingInputSource > for those cases. > > But as Brandon points out, it doesn't solve the case when the input data > is not from a file. And it also requires the user to use a new API > function to create a File specific InputSource. > > I've created DAFFODIL-2397 to track the API issue. > > On 9/14/20 11:19 AM, Sloane, Brandon wrote: >> Does that solve the issue? Not all data sources are files. A file centric >> path could be used as an optimization, but we also need to behave correctly >> on non-file data sources. >> >> Also, if we go this route, we need to be careful about what we consider to >> be a "file". An anonymous or named pipe are commonly used types of files >> which do provide a file-length. >> ________________________________ >> From: Beckerle, Mike <[email protected]> >> Sent: Monday, September 14, 2020 10:40 AM >> To: [email protected] <[email protected]> >> Subject: Re: Large dfdl:length values on complex types >> >> Is it possible to create a separate i/o path that is file centric, so that >> isDefinedForLength can answer without actually accessing any data at all? >> >> The notion that there is no tunable access suggests we do need an API change >> to a factory pattern so that the I/O abstractions that need them have access >> to tunables. Efficient I/O is just one of those areas where tuning is >> generally needed. >> >> ...mikeb >> >> >> ________________________________ >> From: Steve Lawrence <[email protected]> >> Sent: Monday, September 14, 2020 8:32:34 AM >> To: [email protected] <[email protected]> >> Subject: Re: Large dfdl:length values on complex types >> >>> The issue isn't so much backtracking, as fortracking. At the moment, >>> we need to read in all of dfdl:length before we begin parsing the >>> first time, so we get a problem even if we never attempt to >>> backtrack. >> >> This is correct. I think this is the core issue. As best I can come up >> with, the right fix is the removal (or very selective use) of >> isDefinedForLength along with parsers handling some new end of stream >> error. Not sure how big of a change that is or if maybe there's another >> approach. >> >> I've created DAFFODIL-2395 to track this issue. I think this is probably >> a blocker for the next release. >> >> Note that this isn't actually an issue for blobs. The BlobLengthParser >> reads blobs in chunks with a default size of 4KB. And it only calls >> isDefinedForLength for individual chunks, so never comes close the 256MB >> limit. So large blobs are never an issue. The only issue here is with >> complex or simple types with a specified length larger than 256MB. A >> simple type of 256MB seems unlikely, but is very possible for complex types. >> >> Also note that the upcoming streaming changes won't help with this >> issue. That is really just about outputting the infoset while we parse >> rather than waiting until the end of parse, as well as the use of a SAX >> API. It won't affect how we handle the data at all. >> >> On 9/11/20 3:04 PM, Sloane, Brandon wrote: >>> We should be careful about making the cache too big due to the memory >>> overhead. I'm not sure where we are with support for streaming parses >>> (beyond the --stream option in the CLI, which assumes multiple distinct >>> documents that have been concatted), so in most cases, a multiple of this >>> overhead would be needed by the infoset itself anyway. >>> >>> The reason I bring up blobs is that they bypass the requirement to have a >>> large infoset. Imagine a 4GB document, consisting of a length prefix, >>> 3.99GB worth of binary blobs, and metadata. There is no good reason parsing >>> this document should require anywhere near 4GB memory overhead. >>> >>>> The DFDL spec allows implementations to have limits on backtracking >>>> capability, it just requires us to document them. >>> >>> The issue isn't so much backtracking, as fortracking. At the moment, we >>> need to read in all of dfdl:length before we begin parsing the first time, >>> so we get a problem even if we never attempt to backtrack. >>> >>> >>> >>> ________________________________ >>> From: Beckerle, Mike <[email protected]> >>> Sent: Friday, September 11, 2020 2:51 PM >>> To: [email protected] <[email protected]> >>> Subject: Re: Large dfdl:length values on complex types >>> >>> We have tunable limits people have to sometimes enlarge. E.g. max size of a >>> regex match is limited. >>> >>> The DFDL spec allows implementations to have limits on backtracking >>> capability, it just requires us to document them. >>> >>> That said. We really don't want to have this sort of limit hard coded. >>> >>> A non-streamable file format that just stores the total length in a header >>> record is pretty common, and people will use those for really big pieces of >>> data these days. >>> >>> Can this 256MB just be a tunable constant that people can enlarge? >>> >>> From a security perspective, a stored length field is always something that >>> should be sanity checked - if clobbered, it could contain -1 and that could >>> be interpreted as maxInt or something, resulting in a denial-of-service >>> attack. It would be better to get a processing error in that case. >>> >>> Can we just check for complex type lengths against a tunable limit, and >>> error out before we even try to read it in? >>> >>> ________________________________ >>> From: Sloane, Brandon <[email protected]> >>> Sent: Friday, September 11, 2020 1:53 PM >>> To: [email protected] <[email protected]> >>> Subject: Re: Large dfdl:length values on complex types >>> >>> 256MB doesn't strike me as that big. We haven't ad filesize limits measured >>> in GB since fat32 and ext1 (depending on settings; even ext4 can have a >>> limit as low as 16gb). All it takes is such a file to have a length prefix, >>> and we can very easily run into the limit. Combined with our support of >>> large binary blobs, it is not unreasonable that someone would want to use >>> DFDL on such a format. >>> >>> ________________________________ >>> From: Beckerle, Mike <[email protected]> >>> Sent: Friday, September 11, 2020 1:24 PM >>> To: [email protected] <[email protected]> >>> Subject: Re: Large dfdl:length values on complex types >>> >>> Maybe a silly question, but why don't we just hit a tunable size limit >>> immediately before we "try to read" that data? 256MB is very big. >>> >>> Is this a real format, or a test case designed to push the boundaries? >>> >>> >>> ________________________________ >>> From: Steve Lawrence <[email protected]> >>> Sent: Friday, September 11, 2020 1:14 PM >>> To: [email protected] <[email protected]> >>> Subject: Large dfdl:length values on complex types >>> >>> I recently came across an issue where we have something like this: >>> >>> <xs:element name="length" type="xs:int" ... /> >>> <xs:element name="data" >>> dfdl:lengthKind="explicit" dfdl:length="{ ../length }"> >>> <xs:complexType> >>> <xs:sequence> >>> <xs:element name="field1" ... /> >>> <xs:element name="field2" ... /> >>> ... >>> <xs:element name="fieldN" ... /> >>> </xs:sequence> >>> </xs:complexType> >>> </xs:element> >>> >>> So we have a length element and a complex data field that uses this >>> length, and the data field is made up of a bunch of fields. >>> >>> The issue I come across is related to how we cache bytes in buckets for >>> backtracking. As we fill up buckets, we currently limit the total amount >>> cache size of the buckets to 256MB. So if someone ever parses more than >>> 256MB of data and then tries to backtrack past that, we error. The idea >>> being that we don't want to keep an infinite cache for potential >>> backtracking and people should have realized that they went down the >>> wrong branch much earlier. >>> >>> Though, a problem occurs with the complex types with a large specified >>> length like above. When we have the complex type with expression >>> ../length, before trying to parse any of the fields, we read that length >>> number of bytes into our cache buckets to confirm that that number of >>> bytes exists. The problem occurs if length is more than 256MB. In this >>> case, we read length number of bytes, and start removing elements from >>> the cache once we read more than 256MB. >>> >>> But once that succeeds and we read length bytes, we then try to start >>> parsing the fields within the complex type, but we've removed those >>> early cached bytes, and so we fail with an unhelpful backtracking exception. >>> >>> I'm not sure of the right solution here. >>> >>> Perhaps we shouldn't be throwing away these bytes when dealing with >>> complex lengths? >>> >>> Or perhaps we shouldn't even be trying to determine if that many bytes >>> are available when we have a specified length. Instead, maybe we should >>> just set the bit limit to make sure we don't parse more than than that? >>> And if eventually something tries to read a byte and there aren't enough >>> and we hit that limit, only then do we fail. This feels like the right >>> solution, but wanted to start a discussion to see if maybe there's a >>> reason we try to read the full length, or maybe there's another alternative? >>> >> >> > >
