hudi-bot opened a new issue, #14728:
URL: https://github.com/apache/hudi/issues/14728
Input and Output streams created in HUDI through calls to
HoodieWrapperFileSystem do not include any buffering unless the underlying file
system implements buffering.
DistributedFileSystem (over HDFS) does not implement any buffering. This
leads to very large number of small-sized IO calls being send to the HDFS while
performing HUDI IO operations like reading parquet, writing parquet,
reading/writing log files, reading/writing instants, etc.
This patch introduces buffering at the HoodieWrapperFileSystem level so that
all types of reads and writes benefit from buffering.
In my tests with at scale on HDFS writing 1million records into a parquet
file (read from an existing parquet file in the same dataset), I observed the
following benefits:
# about 40% reduction in total time to run the test
# Total write calls to HDFS reduced from 19.1M -> 328
# Total read calls reduced from 229M -> 515K
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-1554
- Type: Improvement
---
## Comments
08/Aug/21 20:17;githubbot;hudi-bot edited a comment on pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#issuecomment-869762023
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
"status" : "FAILURE",
"url" :
"https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501",
"triggerID" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* ba72d3ee9f569bc68f21d410e672378881c954b9 Azure:
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501)
<details>
<summary>Bot commands</summary>
The @flinkbot bot supports the following commands:
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
;;;
---
09/Aug/21 04:21;githubbot;hudi-bot edited a comment on pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#issuecomment-869762023
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
"status" : "FAILURE",
"url" :
"https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501",
"triggerID" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* ba72d3ee9f569bc68f21d410e672378881c954b9 Azure:
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]