Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-02-09 Thread via GitHub


parisni commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209

   Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open 
documentation](https://github.com/apache/hudi/pull/9056/files) about bloom 
filters which states:
   
   > So bloom would be useful in either case (at the parquet file level) :
   > - the column has no duplicates
   > - the column number of unique values is more than 40k
   
   If your column is not in this case, then parquet bloom would only add 
overhead, and would slow down a given query.
   
   There is also [benchmarks on spark 
side](https://github.com/apache/spark/blob/master/sql/core/benchmarks/BloomFilterBenchmark-results.txt)
 that could be of interest 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub


bhasudha commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297

   Hi  @bk-mz  . Wanted to add to this thread. Query latency may not be the 
only metric to measure like explained in the above threads. The runs with 
parquet native bloom filters enabled and still taking similar time could be 
dominated by few factors: the need to still open all files to load the parquet 
native bloom filter, S3 throttling etc. 
   
   One way I would try testing this is to remove Hudi from the picture and take 
the same parquet dataset, and run it with and without parquet native bloom 
filter enabled. You should be able to see the output rows reduced, but the 
query time may not be that improved due to the need to load each of these files 
to read the bloom filters. 
   
   The Column stats in Hudi's metadata table helps to reduce the number of 
files scanned (unlike parquet native bloom filters).   With data skipping 
enabled, Hudi uses the column stats stored in the metadata table instead of 
scanning the metadata in each parquet file, so Hudi can better plan the query 
with such stats and the predicates by scanning/reading fewer files when 
possible (see this 
[blog](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude)
 for more details on data skipping in Hudi).  This is particularly helpful on 
cloud storage as cloud storage requests have constant overhead and are subject 
to rate limiting. 
   
   You bring valid feedback that we will take and work on - better showcasing 
the impact of using these indexes so the users can easily spot them. Will 
update you back on how we are incorporating this shortly.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1919010119

   >when number of output rows with bloom is clearly lot less than number of 
output rows without bloom.
   
   @ad1happy2go 
   
   The query performance is same for both ro and snapshot cases, therefore I'm 
making that statement. Just having one number smaller than other number is 
cryptic. 
   
   >You can also try column stats indexing also in this case. 
   
   As you can see, they are enabled:
   
   ```hoodie.metadata.index.bloom.filter.column.list=id,account_id
   hoodie.metadata.index.bloom.filter.enable=true
   hoodie.metadata.index.column.stats.column.list=id,account_id
   hoodie.metadata.index.column.stats.enable=true```
   
   My concern with Hudi and in this ticket specifically, that today Hudi does 
not allow you to introspect and figure out that any statistical or indexing 
solution is actually improving performance. 
   
   We can't tie hudi configurations with actual results, they are logically not 
connected as seen from queries above. 
   
   I.e. I can't say "ok I removed that configuration and my query started to 
lag", nor vice-versa, I also can't say "I added that column in statistics 
config and my queries are faster now", because there are no metrics nor 
practical evidences from anywhere helping to understand the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-31 Thread via GitHub


ad1happy2go commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1918965640

   @bk-mz Why do you think "indexing and statistical means in hudi are 
ineffective" when number of output rows with bloom is clearly lot less than 
number of output rows without bloom. 
   You can also try column stats indexing also in this case. That will optimise 
your read queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-24 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748

   >What do you think about,
   
   TBH a bit of mixed emotions here.
   
   With 0.14 there is practically no way in understanding how indexing or 
statistical means are affecting queries apart from "output number of rows" in 
Spark SQL dataframe, i.e. are they used at all and if they are, how effectively?
   
   This issue could be closed, from out end we'll move further with assumption 
that indexing and statistical means in hudi are ineffective, though we'd enable 
them on our critical fields in case further releases of hudi would implement 
performance improvements.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1905205987

   There will be a variety of factor leading to the difference time in the 
query, like IO、cpu、dick load... in spark, like parallelism,  the expand time of 
executor..., in hudi, snapshot reading should be slow than read-optimized 
theoretically, and they use diff reader to read diff file( ro base or rt 
base+log file).
   And there is another problem, does parquet file with bloom filter will 
faster than without bloom filter in reading? I don't think it is certain, you 
need to look at its actual production effect. 
   In spark query, the difference between 2S cannot explain the slow problem. 
What do you think about, this is my shallow cognition, maybe others have better 
opinion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742903

   I.e. it's caused by a RO reader just reading different files? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904742187

   how can we clarify that the difference is not cause by read-optimized and 
snapshot paths excluding any bloom filters on indexes? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904082576

   @bk-mz yes, according to the indicators, it is effective 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904051040

   @KnightChess Did I understand you correctly, you are claiming that bloom 
filters actually work correctly? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1904045416

   ```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 
s, 3.4 s )``` for snapshot.
   ```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552 
ms, 5.9 s )``` for read-optimized


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903609477

   we can only analyse the scan rdd. A query contains time consumption in 
various aspects. the result I think is normal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579710

   
![image](https://github.com/apache/hudi/assets/20125927/2dd2b745-96b2-464d-8541-1119197bed48)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903579464

   @bk-mz can you see the cost time in this point?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-22 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1903567882

   for snapshot: 441,483,112, query time 28141ms
   for read-optimized: 22,887,045, query time 26054ms.
   
   
![read-optimized](https://github.com/apache/hudi/assets/892781/d61438ac-3792-4217-9b79-23783128def1)
   
![snapshot](https://github.com/apache/hudi/assets/892781/3d8d3326-8eb6-4a3a-88a7-0b46d27405e7)
   
   ```
   scala> spark.time({
|   val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "read_optimized")
| 
.load("s3://com.twilio.messaging.mdp.datalake.tables.mdp.pffm.pii.temp/table/")
|
|   val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_sid") === "AC5aa523aa6e1271134f9adda35cd08c7c")
|   ).count()
|
|   println(s"Count: $count")
| })
   24/01/22 09:05:38 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   Count: 47
   Time taken: 26054 ms
   ```
   
   ```
   scala> spark.time({
|   val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "snapshot")
| 
.load("s3://com.twilio.messaging.mdp.datalake.tables.mdp.pffm.pii.temp/table/")
|
|   val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_sid") === "AC5aa523aa6e1271134f9adda35cd08c7c")
|   ).count()
|
|   println(s"Count: $count")
| })
   24/01/22 09:09:03 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   Count: 47
   Time taken: 28141 ms
   ```
   
   Okay, your point stands, the number of output rows are indeed different.
   
   Though, how can we explain same query times?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-21 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1902623887

   @bk-mz you can see scan rdd `the number of output rows` in spark sql tag ui.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-19 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1900238202

   Sure, but anything specific you want to see?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-18 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898309708

   @bk-mz the cache of the operating system may also have an impact, can you 
provide detailed metrics for spark ui?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-18 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1898187110

   ```scala> spark.time({
|   val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "read_optimized")
| .load("s3://path/table/")
|
|   val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_id") === "id1")
|   ).count()
|
|   println(s"Count: $count")
| })
   Count: 47
   Time taken: 30477 ms```
   
   ```
   scala> spark.time({
|   val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "snapshot")
| .load("s3://path/table/")
|
|   val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_sid") === "id1")
|   ).count()
|
|   println(s"Count: $count")
| })
   24/01/18 10:06:51 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   Count: 47
   Time taken: 22594 ms```
   
   For clean experiment, I created 2 separate sessions for queries above.
   
   It's just super confusing as it contradicts the logic. So `read_optimized` 
actually takes more time to load same data as it's done with `snapshot`.
   
   Can we say for sure that use of bloom filters on parquet native filters is 
bluntly not effective for hudi? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1897674210

   @bk-mz yes, `set hoodie.datasource.query.type = read_optimized`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1896670833

   >mor read_optimized can use it.
   can i set spark-sql to use read_optimized to test it out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-17 Thread via GitHub


KnightChess commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1895858498

   @bk-mz yes, mor not support parquet native bloom filter, because log file 
will merge on read, so native bloom filter is not the latest, is not accurate, 
only `cow` or `mor read_optimized` can use it.
   
   And in version  0.14.0, bloom filter in hudi only be used in write to tag 
record.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org