[ 
https://issues.apache.org/jira/browse/GOBBLIN-1918?focusedWorklogId=881468&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-881468
 ]

ASF GitHub Bot logged work on GOBBLIN-1918:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Sep/23 20:00
            Start Date: 22/Sep/23 20:00
    Worklog Time Spent: 10m 
      Work Description: Will-Lo commented on code in PR #3787:
URL: https://github.com/apache/gobblin/pull/3787#discussion_r1334771397


##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/OrcConverterMemoryManager.java:
##########
@@ -97,4 +149,19 @@ public long getConverterBufferTotalSize() {
     return converterBufferTotalSize;
   }
 
+  /**
+   * Resize the child-array size based on configuration.
+   * If smart resizing is enabled, it will use an exponential decay algorithm 
where it would resize the array by a smaller amount
+   * the more records the converter has processed, as the fluctuation in 
record size becomes less likely to differ significantly by then
+   * Since the writer is closed and reset periodically, this is generally a 
safe assumption that should prevent large empty array buffers
+   */
+  public int resize(int rowsAdded, int requestedSize) {
+    resizeCount += 1;
+    log.info(String.format("It has been resized %s times in current writer", 
resizeCount));
+    if (enabledSmartSizing) {
+      double decayingEnlargeFactor =  this.smartArrayEnlargeFactorMax * 
Math.pow((1-this.smartArrayEnlargeDecayFactor), rowsAdded-1);

Review Comment:
   My reasoning is that the rowbatch buffer size should also be proportional to 
the number of maximum records in the rowbatch. The more filled the rowbatch is, 
the less likely there will be a need to have a large increase in the column 
given that the records are approximately close in size. This is also why the 
min will still leave a 20% buffer in the end.
   
   Although with self tuning now the rowbatch max size will be variable, so 
there can be some degradation if say rowBatch starts small, processes a lot of 
records, then rowbatch size increases then resizes will become less effective, 
but I think those instances are pretty rare, and it will still be able to 
handle and recover (especially given that the Writer resets every 5 minutes)





Issue Time Tracking
-------------------

    Worklog Id:     (was: 881468)
    Time Spent: 50m  (was: 40m)

> Optimize smart resizing for ORC Writer converter buffer
> -------------------------------------------------------
>
>                 Key: GOBBLIN-1918
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1918
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-core
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The GobblinOrcWriter contains a converter and a buffer rowbatch. The buffer 
> holds the converted Avro -> Orc records before adding them to the native orc 
> writer.
> Since it can contain multiple records, it constantly needs to resize the 
> columns of the rowbatch in order to hold multiple records. This problem 
> affects both performance and memory when resizing is done either too often 
> (enlarge factor is too low) or not often enough (enlarge factor is too high 
> and thus the buffer dominates the container memory).
> Because there is a bounded number of records that can persist in the buffer 
> before getting flushed, we want to reduce the aggressiveness of the resizing 
> algorithm the more records that have been processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to