[ 
https://issues.apache.org/jira/browse/GOBBLIN-1898?focusedWorklogId=879592&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-879592
 ]

ASF GitHub Bot logged work on GOBBLIN-1898:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Sep/23 21:16
            Start Date: 08/Sep/23 21:16
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3762:
URL: https://github.com/apache/gobblin/pull/3762#discussion_r1320347626


##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -47,25 +47,35 @@
 public abstract class GobblinBaseOrcWriter<S, D> extends FsDataWriter<D> {
   public static final String ORC_WRITER_PREFIX = "orcWriter.";
   public static final String ORC_WRITER_BATCH_SIZE = ORC_WRITER_PREFIX + 
"batchSize";
-  public static final int DEFAULT_ORC_WRITER_BATCH_SIZE = 1000;
   public static final String ORC_WRITER_AUTO_SELFTUNE_ENABLED = 
ORC_WRITER_PREFIX + "auto.selfTune.enabled";
   public static final String ORC_WRITER_ESTIMATED_RECORD_SIZE = 
ORC_WRITER_PREFIX + "estimated.recordSize";
+  public static final String ORC_WRITER_AUTO_SELFTUNE_MAX_BATCH_SIZE = 
ORC_WRITER_PREFIX + "auto.selfTune.max.batch.size";

Review Comment:
   Also please add comment to specify which are needed from job input which are 
set during runtime?



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -47,25 +47,35 @@
 public abstract class GobblinBaseOrcWriter<S, D> extends FsDataWriter<D> {
   public static final String ORC_WRITER_PREFIX = "orcWriter.";
   public static final String ORC_WRITER_BATCH_SIZE = ORC_WRITER_PREFIX + 
"batchSize";
-  public static final int DEFAULT_ORC_WRITER_BATCH_SIZE = 1000;
   public static final String ORC_WRITER_AUTO_SELFTUNE_ENABLED = 
ORC_WRITER_PREFIX + "auto.selfTune.enabled";
   public static final String ORC_WRITER_ESTIMATED_RECORD_SIZE = 
ORC_WRITER_PREFIX + "estimated.recordSize";
+  public static final String ORC_WRITER_AUTO_SELFTUNE_MAX_BATCH_SIZE = 
ORC_WRITER_PREFIX + "auto.selfTune.max.batch.size";

Review Comment:
   Can we put all these config into a separate config classes?





Issue Time Tracking
-------------------

            Worklog Id:     (was: 879592)
    Remaining Estimate: 0h
            Time Spent: 10m

> Improve performance of Selftune ORC Writer
> ------------------------------------------
>
>                 Key: GOBBLIN-1898
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1898
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-core
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ORCWriter's new self tuning feature leads to slower write frequency when 
> it comes to ingesting datasets with a low volume of records.
> This is primarily caused by the assumption that the native ORC writer will be 
> saturated, which leads to the memory footprint of STRIPE_SIZE + 
> avgSizeOfRecord*rowsBetweenMemoryCheck.
> However, this is generally not the case when there are only a few records to 
> write due to a low volume dataset, and causes slow writes. We should utilize 
> a newer API on ORCWriter brought in by 
> [https://github.com/apache/orc/pull/1057]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to