parisni commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209
Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open
documentation](https://github.com/apache/hudi/pull/9056/files) about bloom
filters which states:
>
bhasudha commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297
Hi @bk-mz . Wanted to add to this thread. Query latency may not be the
only metric to measure like explained in the above threads. The runs with
parquet native bloom filters
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1919010119
>when number of output rows with bloom is clearly lot less than number of
output rows without bloom.
@ad1happy2go
The query performance is same for both ro and snapshot
ad1happy2go commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1918965640
@bk-mz Why do you think "indexing and statistical means in hudi are
ineffective" when number of output rows with bloom is clearly lot less than
number of output rows without
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748
>What do you think about,
TBH a bit of mixed emotions here.
With 0.14 there is practically no way in understanding how indexing or
statistical means are affecting queries
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1905205987
There will be a variety of factor leading to the difference time in the
query, like IO、cpu、dick load... in spark, like parallelism, the expand time of
executor..., in hudi,
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742903
I.e. it's caused by a RO reader just reading different files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742187
how can we clarify that the difference is not cause by read-optimized and
snapshot paths excluding any bloom filters on indexes?
--
This is an automated message from the Apache Git
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904082576
@bk-mz yes, according to the indicators, it is effective
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904051040
@KnightChess Did I understand you correctly, you are claiming that bloom
filters actually work correctly?
--
This is an automated message from the Apache Git Service.
To respond to
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904045416
```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5
s, 3.4 s )``` for snapshot.
```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903609477
we can only analyse the scan rdd. A query contains time consumption in
various aspects. the result I think is normal.
--
This is an automated message from the Apache Git
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579710
![image](https://github.com/apache/hudi/assets/20125927/2dd2b745-96b2-464d-8541-1119197bed48)
--
This is an automated message from the Apache Git Service.
To respond to the
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579464
@bk-mz can you see the cost time in this point?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903567882
for snapshot: 441,483,112, query time 28141ms
for read-optimized: 22,887,045, query time 26054ms.
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1902623887
@bk-mz you can see scan rdd `the number of output rows` in spark sql tag ui.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1900238202
Sure, but anything specific you want to see?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898309708
@bk-mz the cache of the operating system may also have an impact, can you
provide detailed metrics for spark ui?
--
This is an automated message from the Apache Git Service.
To
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898187110
```scala> spark.time({
| val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "read_optimized")
|
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1897674210
@bk-mz yes, `set hoodie.datasource.query.type = read_optimized`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub
bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1896670833
>mor read_optimized can use it.
can i set spark-sql to use read_optimized to test it out?
--
This is an automated message from the Apache Git Service.
To respond to the message,
KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1895858498
@bk-mz yes, mor not support parquet native bloom filter, because log file
will merge on read, so native bloom filter is not the latest, is not accurate,
only `cow` or `mor
bk-mz opened a new issue, #10511:
URL: https://github.com/apache/hudi/issues/10511
**Describe the problem you faced**
We encountered an issue with MOR table that utilizes metadata bloom filters
and Parquet bloom filters, and has enabled statistics. When attempting to query
data, the
23 matches
Mail list logo