hudi-bot opened a new issue, #14728:
URL: https://github.com/apache/hudi/issues/14728

   Input and Output streams created in HUDI through calls to 
HoodieWrapperFileSystem do not include any buffering unless the underlying file 
system implements buffering.
   
   DistributedFileSystem (over HDFS) does not implement any buffering. This 
leads to very large number of small-sized IO calls being send to the HDFS while 
performing HUDI IO operations like reading parquet, writing parquet, 
reading/writing log files, reading/writing instants, etc. 
   
   This patch introduces buffering at the HoodieWrapperFileSystem level so that 
all types of reads and writes benefit from buffering.
   
    
   
   In my tests with at scale on HDFS writing 1million records into a parquet 
file (read from an existing parquet file in the same dataset), I observed the 
following benefits:
    # about 40% reduction in total time to run the test  
    # Total write calls to HDFS reduced from 19.1M -> 328
    # Total read calls reduced from 229M -> 515K
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1554
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   08/Aug/21 20:17;githubbot;hudi-bot edited a comment on pull request #2496:
   URL: https://github.com/apache/hudi/pull/2496#issuecomment-869762023
   
   
      <!--
      Meta data
      {
        "version" : 1,
        "metaDataEntries" : [ {
          "hash" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
          "status" : "FAILURE",
          "url" : 
"https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501";,
          "triggerID" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
          "triggerType" : "PUSH"
        } ]
      }-->
      ## CI report:
      
      * ba72d3ee9f569bc68f21d410e672378881c954b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501)
 
      
      <details>
      <summary>Bot commands</summary>
        The @flinkbot bot supports the following commands:
      
       - `@flinkbot run travis` re-run the last Travis build
       - `@flinkbot run azure` re-run the last Azure build
      </details>
   
   
   -- 
   This is an automated message from the Apache Git Service.
   To respond to the message, please log on to GitHub and use the
   URL above to go to the specific comment.
   
   To unsubscribe, e-mail: [email protected]
   
   For queries about this service, please contact Infrastructure at:
   [email protected]
   ;;;
   
   ---
   
   09/Aug/21 04:21;githubbot;hudi-bot edited a comment on pull request #2496:
   URL: https://github.com/apache/hudi/pull/2496#issuecomment-869762023
   
   
      <!--
      Meta data
      {
        "version" : 1,
        "metaDataEntries" : [ {
          "hash" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
          "status" : "FAILURE",
          "url" : 
"https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501";,
          "triggerID" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
          "triggerType" : "PUSH"
        } ]
      }-->
      ## CI report:
      
      * ba72d3ee9f569bc68f21d410e672378881c954b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501)
 
      
      <details>
      <summary>Bot commands</summary>
        @hudi-bot supports the following commands:
      
       - `@hudi-bot run travis` re-run the last Travis build
       - `@hudi-bot run azure` re-run the last Azure build
      </details>
   
   
   -- 
   This is an automated message from the Apache Git Service.
   To respond to the message, please log on to GitHub and use the
   URL above to go to the specific comment.
   
   To unsubscribe, e-mail: [email protected]
   
   For queries about this service, please contact Infrastructure at:
   [email protected]
   ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to