This is an automated email from the ASF dual-hosted git repository.
slawrence pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-daffodil.git
The following commit(s) were added to refs/heads/master by this push:
new 40c8b79 Fix how we throw away buckets during parsing
40c8b79 is described below
commit 40c8b799d68c94545f5a2ddc9c8753986cd168e1
Author: Steve Lawrence <[email protected]>
AuthorDate: Thu Jan 14 09:47:40 2021 -0500
Fix how we throw away buckets during parsing
The BucketingInputSource caches small-ish buckets of data in a "buckets"
array while parsing. As we determine that Daffodil will no longer need
data from older buckets (e.g. when PoU's are resolved), we set the array
indices to null so that Java can garbage collect those buckets. This
means that although the size of the buckets array may grow large for
very large data, the majority of the elements in that array are null, so
the actual cached data in memory is quite small.
Sometimes we cannot discard buckets due to unresolved PoU's, so we
impose a maximum number of buckets that can be cached. Once we reach
this number of buckets in the buckets array, we simply discard the
oldest bucket by setting it to null like above. If anything ever tries
to use data from a discarded bucket we throw a backtracking error. In
practice, the maximum bucket limit is equivalent to about 256MB of
cached data, so is high enough that nothing reasonably needs to
backtrack that far.
However, we incorrectly calculate when to throw away the oldest bucket.
We currently do so when the buckets array grows beyond some maximum
number of elements. But this just means we've parsed some large amount
of data, not that we have actually cached a large amount of data--this
doesn't take into account the fact that many of the buckets in the array
may have been discarded. This can lead to a situation where we think we
have reached some a maximum cache size, but we really haven't, and so we
start throwing away buckets that we actually need and could reasonably
backtrack to. And if we do backtrack that small amount, we get an error.
So instead of just throwing away the oldest bucket once the buckets
array grows to some size, we really only want do so when the number of
*non-null buckets* grows to some size, which is what actually implies
the cached data has grown too large. This patch fixes the calculation to
do that.
DAFFODIL-2455
---
.../src/main/scala/org/apache/daffodil/io/InputSource.scala | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git
a/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
b/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
index 24dc6c6..0c6586f 100644
--- a/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
+++ b/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
@@ -188,7 +188,15 @@ class BucketingInputSource(
val bytes = new Array[Byte](bucketSize)
}
- private val maxNumberOfBuckets = Math.max(maxCacheSizeInBytes / bucketSize,
2)
+ /**
+ * The maximum number of buckets that can be cached in the buckets array. Note
+ * that because we periodically remove buckets from the buckets array by
+ * setting them to null, it is possible (and expected for large data) for the
+ * size of the buckets array to grow beyond this number. So this does not
+ * represent the maximum size of the buckets array, but instead represents the
+ * maximum number of non-null elements in the buckets array.
+ */
+ private val maxNumberOfNonNullBuckets = Math.max(maxCacheSizeInBytes /
bucketSize, 2)
/**
* Array of buckets
@@ -276,7 +284,7 @@ class BucketingInputSource(
buckets += new Bucket()
bytesFilledInLastBucket = 0
lastBucketIndex += 1
- if (buckets.size > maxNumberOfBuckets) {
+ if ((lastBucketIndex - oldestBucketIndex) >=
maxNumberOfNonNullBuckets) {
// This frees the oldest bucket, allowing it to be garbage
collected.
buckets(oldestBucketIndex.toInt) = null
oldestBucketIndex += 1