Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-14 Thread via GitHub


waitingF commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1891361011

   > Guess it is valuable for insert only table.
   
   Nope. Theoretically, it will be beneficial to all non-bloom index scenarios, 
including insert, upsert, and delete. Because these operations may involve 
writing Parquet files, this skip bloom optimization will be applied at this 
time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-12 Thread via GitHub


danny0405 commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1888757163

   Guess it is valuable for insert only table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-12 Thread via GitHub


bvaradar merged PR #10457:
URL: https://github.com/apache/hudi/pull/10457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-09 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1884319262

   
   ## CI report:
   
   * af71b6b0adf5722b58b941ad129f685f1242a808 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21898)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-09 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1884141208

   
   ## CI report:
   
   * 7c668bbb0b7cafeb9b6c4d302d6154c91beb366e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21859)
 
   * af71b6b0adf5722b58b941ad129f685f1242a808 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21898)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-09 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1884135931

   
   ## CI report:
   
   * 7c668bbb0b7cafeb9b6c4d302d6154c91beb366e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21859)
 
   * af71b6b0adf5722b58b941ad129f685f1242a808 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1880899016

   
   ## CI report:
   
   * 7c668bbb0b7cafeb9b6c4d302d6154c91beb366e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21859)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


waitingF commented on code in PR #10457:
URL: https://github.com/apache/hudi/pull/10457#discussion_r135982


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroFileWriterFactory.java:
##
@@ -51,7 +51,7 @@ protected HoodieFileWriter newParquetFileWriter(
   String instantTime, Path path, Configuration conf, HoodieConfig config, 
Schema schema,
   TaskContextSupplier taskContextSupplier) throws IOException {
 boolean populateMetaFields = 
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
-boolean enableBloomFilter = populateMetaFields;
+boolean enableBloomFilter = populateMetaFields && 
config.getBooleanOrDefault(HoodieStorageConfig.PARQUET_WITH_BLOOM_FILTER_ENABLED);

Review Comment:
   Nice advice, will adjust



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


voonhous commented on code in PR #10457:
URL: https://github.com/apache/hudi/pull/10457#discussion_r119915


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroFileWriterFactory.java:
##
@@ -51,7 +51,7 @@ protected HoodieFileWriter newParquetFileWriter(
   String instantTime, Path path, Configuration conf, HoodieConfig config, 
Schema schema,
   TaskContextSupplier taskContextSupplier) throws IOException {
 boolean populateMetaFields = 
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
-boolean enableBloomFilter = populateMetaFields;
+boolean enableBloomFilter = populateMetaFields && 
config.getBooleanOrDefault(HoodieStorageConfig.PARQUET_WITH_BLOOM_FILTER_ENABLED);

Review Comment:
   Possible to add a check here to ensure that user is not using Bloom Index 
too?
   
   Maybe put it into a function so that it is visible/can be used for UT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1880718436

   
   ## CI report:
   
   * 7c668bbb0b7cafeb9b6c4d302d6154c91beb366e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21859)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10457:
URL: https://github.com/apache/hudi/pull/10457#issuecomment-1880652695

   
   ## CI report:
   
   * 7c668bbb0b7cafeb9b6c4d302d6154c91beb366e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7278] make bloom filter skippable for CPU saving [hudi]

2024-01-08 Thread via GitHub


waitingF opened a new pull request, #10457:
URL: https://github.com/apache/hudi/pull/10457

   ### Change Logs
   
   When ingesting with parquet, even we use the bucket index, it will write the 
bloom filter which is not used.
   
   We can save about 9% CPU if we can skip the bloom filter in those cases.
   
   
![image](https://github.com/apache/hudi/assets/19326824/63dda2ac-23fe-407b-8734-9987e4a27599)
   
   
   ### Impact
   
   CPU saving if skip writing bloom filter
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   - Add a new config named `hoodie.parquet.bloom.filter.enabled`, default 
true. We can set to false for CPU saving in non bloom index cases.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org