[incubator-daffodil] branch master updated: Fix how we throw away buckets during parsing

slawrence Tue, 19 Jan 2021 07:10:50 -0800

This is an automated email from the ASF dual-hosted git repository.

slawrence pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-daffodil.git



The following commit(s) were added to refs/heads/master by this push:
     new 40c8b79  Fix how we throw away buckets during parsing
40c8b79 is described below

commit 40c8b799d68c94545f5a2ddc9c8753986cd168e1
Author: Steve Lawrence <[email protected]>
AuthorDate: Thu Jan 14 09:47:40 2021 -0500

    Fix how we throw away buckets during parsing
    
    The BucketingInputSource caches small-ish buckets of data in a "buckets"
    array while parsing. As we determine that Daffodil will no longer need
    data from older buckets (e.g. when PoU's are resolved), we set the array
    indices to null so that Java can garbage collect those buckets. This
    means that although the size of the buckets array may grow large for
    very large data, the majority of the elements in that array are null, so
    the actual cached data in memory is quite small.
    
    Sometimes we cannot discard buckets due to unresolved PoU's, so we
    impose a maximum number of buckets that can be cached. Once we reach
    this number of buckets in the buckets array, we simply discard the
    oldest bucket by setting it to null like above. If anything ever tries
    to use data from a discarded bucket we throw a backtracking error. In
    practice, the maximum bucket limit is equivalent to about 256MB of
    cached data, so is high enough that nothing reasonably needs to
    backtrack that far.
    
    However, we incorrectly calculate when to throw away the oldest bucket.
    We currently do so when the buckets array grows beyond some maximum
    number of elements. But this just means we've parsed some large amount
    of data, not that we have actually cached a large amount of data--this
    doesn't take into account the fact that many of the buckets in the array
    may have been discarded. This can lead to a situation where we think we
    have reached some a maximum cache size, but we really haven't, and so we
    start throwing away buckets that we actually need and could reasonably
    backtrack to. And if we do backtrack that small amount, we get an error.
    
    So instead of just throwing away the oldest bucket once the buckets
    array grows to some size, we really only want do so when the number of
    *non-null buckets* grows to some size, which is what actually implies
    the cached data has grown too large. This patch fixes the calculation to
    do that.
    
    DAFFODIL-2455
---
 .../src/main/scala/org/apache/daffodil/io/InputSource.scala  | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git 
a/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala 
b/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
index 24dc6c6..0c6586f 100644
--- a/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
+++ b/daffodil-io/src/main/scala/org/apache/daffodil/io/InputSource.scala
@@ -188,7 +188,15 @@ class BucketingInputSource(
     val bytes = new Array[Byte](bucketSize)
   }
 
-  private val maxNumberOfBuckets = Math.max(maxCacheSizeInBytes / bucketSize, 
2)
+  /**
+  * The maximum number of buckets that can be cached in the buckets array. Note
+  * that because we periodically remove buckets from the buckets array by
+  * setting them to null, it is possible (and expected for large data) for the
+  * size of the buckets array to grow beyond this number. So this does not
+  * represent the maximum size of the buckets array, but instead represents the
+  * maximum number of non-null elements in the buckets array.
+  */
+  private val maxNumberOfNonNullBuckets = Math.max(maxCacheSizeInBytes / 
bucketSize, 2)
 
   /**
    * Array of buckets
@@ -276,7 +284,7 @@ class BucketingInputSource(
           buckets += new Bucket()
           bytesFilledInLastBucket = 0
           lastBucketIndex += 1
-          if (buckets.size > maxNumberOfBuckets) {
+          if ((lastBucketIndex - oldestBucketIndex) >= 
maxNumberOfNonNullBuckets) {
             // This frees the oldest bucket, allowing it to be garbage 
collected.
             buckets(oldestBucketIndex.toInt) = null
             oldestBucketIndex += 1

[incubator-daffodil] branch master updated: Fix how we throw away buckets during parsing

Reply via email to