Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-02-09 Thread via GitHub
parisni commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209 Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open documentation](https://github.com/apache/hudi/pull/9056/files) about bloom filters which states: >

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub
bhasudha commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297 Hi @bk-mz . Wanted to add to this thread. Query latency may not be the only metric to measure like explained in the above threads. The runs with parquet native bloom filters

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1919010119 >when number of output rows with bloom is clearly lot less than number of output rows without bloom. @ad1happy2go The query performance is same for both ro and snapshot

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub
ad1happy2go commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1918965640 @bk-mz Why do you think "indexing and statistical means in hudi are ineffective" when number of output rows with bloom is clearly lot less than number of output rows without

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-24 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748 >What do you think about, TBH a bit of mixed emotions here. With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1905205987 There will be a variety of factor leading to the difference time in the query, like IO、cpu、dick load... in spark, like parallelism, the expand time of executor..., in hudi,

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742903 I.e. it's caused by a RO reader just reading different files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742187 how can we clarify that the difference is not cause by read-optimized and snapshot paths excluding any bloom filters on indexes? -- This is an automated message from the Apache Git

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904082576 @bk-mz yes, according to the indicators, it is effective -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904051040 @KnightChess Did I understand you correctly, you are claiming that bloom filters actually work correctly? -- This is an automated message from the Apache Git Service. To respond to

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904045416 ```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 s, 3.4 s )``` for snapshot. ```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903609477 we can only analyse the scan rdd. A query contains time consumption in various aspects. the result I think is normal. -- This is an automated message from the Apache Git

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579710 ![image](https://github.com/apache/hudi/assets/20125927/2dd2b745-96b2-464d-8541-1119197bed48) -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579464 @bk-mz can you see the cost time in this point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903567882 for snapshot: 441,483,112, query time 28141ms for read-optimized: 22,887,045, query time 26054ms.

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-21 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1902623887 @bk-mz you can see scan rdd `the number of output rows` in spark sql tag ui. -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-19 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1900238202 Sure, but anything specific you want to see? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-18 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898309708 @bk-mz the cache of the operating system may also have an impact, can you provide detailed metrics for spark ui? -- This is an automated message from the Apache Git Service. To

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-18 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898187110 ```scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") |

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1897674210 @bk-mz yes, `set hoodie.datasource.query.type = read_optimized` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1896670833 >mor read_optimized can use it. can i set spark-sql to use read_optimized to test it out? -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1895858498 @bk-mz yes, mor not support parquet native bloom filter, because log file will merge on read, so native bloom filter is not the latest, is not accurate, only `cow` or `mor

[I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-16 Thread via GitHub
bk-mz opened a new issue, #10511: URL: https://github.com/apache/hudi/issues/10511 **Describe the problem you faced** We encountered an issue with MOR table that utilizes metadata bloom filters and Parquet bloom filters, and has enabled statistics. When attempting to query data, the