Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
parisni commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209 Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open documentation](https://github.com/apache/hudi/pull/9056/files) about bloom filters which states: > So bloom would be useful in either case (at the parquet file level) : > - the column has no duplicates > - the column number of unique values is more than 40k If your column is not in this case, then parquet bloom would only add overhead, and would slow down a given query. There is also [benchmarks on spark side](https://github.com/apache/spark/blob/master/sql/core/benchmarks/BloomFilterBenchmark-results.txt) that could be of interest -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bhasudha commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297 Hi @bk-mz . Wanted to add to this thread. Query latency may not be the only metric to measure like explained in the above threads. The runs with parquet native bloom filters enabled and still taking similar time could be dominated by few factors: the need to still open all files to load the parquet native bloom filter, S3 throttling etc. One way I would try testing this is to remove Hudi from the picture and take the same parquet dataset, and run it with and without parquet native bloom filter enabled. You should be able to see the output rows reduced, but the query time may not be that improved due to the need to load each of these files to read the bloom filters. The Column stats in Hudi's metadata table helps to reduce the number of files scanned (unlike parquet native bloom filters). With data skipping enabled, Hudi uses the column stats stored in the metadata table instead of scanning the metadata in each parquet file, so Hudi can better plan the query with such stats and the predicates by scanning/reading fewer files when possible (see this [blog](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude) for more details on data skipping in Hudi). This is particularly helpful on cloud storage as cloud storage requests have constant overhead and are subject to rate limiting. You bring valid feedback that we will take and work on - better showcasing the impact of using these indexes so the users can easily spot them. Will update you back on how we are incorporating this shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1919010119 >when number of output rows with bloom is clearly lot less than number of output rows without bloom. @ad1happy2go The query performance is same for both ro and snapshot cases, therefore I'm making that statement. Just having one number smaller than other number is cryptic. >You can also try column stats indexing also in this case. As you can see, they are enabled: ```hoodie.metadata.index.bloom.filter.column.list=id,account_id hoodie.metadata.index.bloom.filter.enable=true hoodie.metadata.index.column.stats.column.list=id,account_id hoodie.metadata.index.column.stats.enable=true``` My concern with Hudi and in this ticket specifically, that today Hudi does not allow you to introspect and figure out that any statistical or indexing solution is actually improving performance. We can't tie hudi configurations with actual results, they are logically not connected as seen from queries above. I.e. I can't say "ok I removed that configuration and my query started to lag", nor vice-versa, I also can't say "I added that column in statistics config and my queries are faster now", because there are no metrics nor practical evidences from anywhere helping to understand the cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
ad1happy2go commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1918965640 @bk-mz Why do you think "indexing and statistical means in hudi are ineffective" when number of output rows with bloom is clearly lot less than number of output rows without bloom. You can also try column stats indexing also in this case. That will optimise your read queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748 >What do you think about, TBH a bit of mixed emotions here. With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries apart from "output number of rows" in Spark SQL dataframe, i.e. are they used at all and if they are, how effectively? This issue could be closed, from out end we'll move further with assumption that indexing and statistical means in hudi are ineffective, though we'd enable them on our critical fields in case further releases of hudi would implement performance improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1905205987 There will be a variety of factor leading to the difference time in the query, like IO、cpu、dick load... in spark, like parallelism, the expand time of executor..., in hudi, snapshot reading should be slow than read-optimized theoretically, and they use diff reader to read diff file( ro base or rt base+log file). And there is another problem, does parquet file with bloom filter will faster than without bloom filter in reading? I don't think it is certain, you need to look at its actual production effect. In spark query, the difference between 2S cannot explain the slow problem. What do you think about, this is my shallow cognition, maybe others have better opinion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742903 I.e. it's caused by a RO reader just reading different files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742187 how can we clarify that the difference is not cause by read-optimized and snapshot paths excluding any bloom filters on indexes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904082576 @bk-mz yes, according to the indicators, it is effective -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904051040 @KnightChess Did I understand you correctly, you are claiming that bloom filters actually work correctly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904045416 ```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 s, 3.4 s )``` for snapshot. ```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552 ms, 5.9 s )``` for read-optimized -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903609477 we can only analyse the scan rdd. A query contains time consumption in various aspects. the result I think is normal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579710 ![image](https://github.com/apache/hudi/assets/20125927/2dd2b745-96b2-464d-8541-1119197bed48) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579464 @bk-mz can you see the cost time in this point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903567882 for snapshot: 441,483,112, query time 28141ms for read-optimized: 22,887,045, query time 26054ms. ![read-optimized](https://github.com/apache/hudi/assets/892781/d61438ac-3792-4217-9b79-23783128def1) ![snapshot](https://github.com/apache/hudi/assets/892781/3d8d3326-8eb6-4a3a-88a7-0b46d27405e7) ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://com.twilio.messaging.mdp.datalake.tables.mdp.pffm.pii.temp/table/") | | val count = df.filter( | (df("year") === 2024) && | (df("month") === 1) && | (df("day") === 16) && | (df("account_sid") === "AC5aa523aa6e1271134f9adda35cd08c7c") | ).count() | | println(s"Count: $count") | }) 24/01/22 09:05:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. Count: 47 Time taken: 26054 ms ``` ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "snapshot") | .load("s3://com.twilio.messaging.mdp.datalake.tables.mdp.pffm.pii.temp/table/") | | val count = df.filter( | (df("year") === 2024) && | (df("month") === 1) && | (df("day") === 16) && | (df("account_sid") === "AC5aa523aa6e1271134f9adda35cd08c7c") | ).count() | | println(s"Count: $count") | }) 24/01/22 09:09:03 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. Count: 47 Time taken: 28141 ms ``` Okay, your point stands, the number of output rows are indeed different. Though, how can we explain same query times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1902623887 @bk-mz you can see scan rdd `the number of output rows` in spark sql tag ui. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1900238202 Sure, but anything specific you want to see? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898309708 @bk-mz the cache of the operating system may also have an impact, can you provide detailed metrics for spark ui? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898187110 ```scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://path/table/") | | val count = df.filter( | (df("year") === 2024) && | (df("month") === 1) && | (df("day") === 16) && | (df("account_id") === "id1") | ).count() | | println(s"Count: $count") | }) Count: 47 Time taken: 30477 ms``` ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "snapshot") | .load("s3://path/table/") | | val count = df.filter( | (df("year") === 2024) && | (df("month") === 1) && | (df("day") === 16) && | (df("account_sid") === "id1") | ).count() | | println(s"Count: $count") | }) 24/01/18 10:06:51 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. Count: 47 Time taken: 22594 ms``` For clean experiment, I created 2 separate sessions for queries above. It's just super confusing as it contradicts the logic. So `read_optimized` actually takes more time to load same data as it's done with `snapshot`. Can we say for sure that use of bloom filters on parquet native filters is bluntly not effective for hudi? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1897674210 @bk-mz yes, `set hoodie.datasource.query.type = read_optimized` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1896670833 >mor read_optimized can use it. can i set spark-sql to use read_optimized to test it out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
KnightChess commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1895858498 @bk-mz yes, mor not support parquet native bloom filter, because log file will merge on read, so native bloom filter is not the latest, is not accurate, only `cow` or `mor read_optimized` can use it. And in version 0.14.0, bloom filter in hudi only be used in write to tag record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org