Will-Lo commented on code in PR #3787:
URL: https://github.com/apache/gobblin/pull/3787#discussion_r1334771397
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/OrcConverterMemoryManager.java:
##########
@@ -97,4 +149,19 @@ public long getConverterBufferTotalSize() {
return converterBufferTotalSize;
}
+ /**
+ * Resize the child-array size based on configuration.
+ * If smart resizing is enabled, it will use an exponential decay algorithm
where it would resize the array by a smaller amount
+ * the more records the converter has processed, as the fluctuation in
record size becomes less likely to differ significantly by then
+ * Since the writer is closed and reset periodically, this is generally a
safe assumption that should prevent large empty array buffers
+ */
+ public int resize(int rowsAdded, int requestedSize) {
+ resizeCount += 1;
+ log.info(String.format("It has been resized %s times in current writer",
resizeCount));
+ if (enabledSmartSizing) {
+ double decayingEnlargeFactor = this.smartArrayEnlargeFactorMax *
Math.pow((1-this.smartArrayEnlargeDecayFactor), rowsAdded-1);
Review Comment:
My reasoning is that the rowbatch buffer size should also be proportional to
the number of maximum records in the rowbatch. The more filled the rowbatch is,
the less likely there will be a need to have a large increase in the column
given that the records are approximately close in size. This is also why the
min will still leave a 20% buffer in the end.
Although with self tuning now the rowbatch max size will be variable, so
there can be some degradation if say rowBatch starts small, processes a lot of
records, then rowbatch size increases then resizes will become less effective,
but I think those instances are pretty rare, and it will still be able to
handle and recover (especially given that the Writer resets every 5 minutes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]