boneanxs commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1606013767

   > if oyu could attach the query plan for before and after this change, it 
would be helpful.
   
   There's no query plan difference btw before and after, since all filters 
will be pushed to hudi, but some filters won't take effect before this pr.
   
   I tested a table with 5w partitions(region, date, hour), and print timeCost 
in `org.apache.hudi.SparkHoodieTableFileIndex#tryListByPartitionPathPrefix`
   
   ```scala
     private def tryListByPartitionPathPrefix(partitionColumnNames: 
Seq[String], partitionColumnPredicates: Seq[Expression]) = {
       // Static partition-path prefix is defined as a prefix of the full 
partition-path where only
       // first N partition columns (in-order) have proper (static) values 
bound in equality predicates,
       // allowing in turn to build such prefix to be used in subsequent 
filtering
   
       val startTime = System.currentTimeMillis()
       //...
   
       log.info(s"Time cost to listing files: ${System.currentTimeMillis() - 
startTime}ms")
       result
     }
   ```
   
   Pushed with filter `date=date"2023-06-20`, and run it in Local[10] mode 3 
times, we can see the time can be saved with this pr
   
   ### Before the pr
   
   ```
   23/06/25 18:09:11 INFO HoodieFileIndex: Time cost to listing files: 42745ms
   23/06/25 18:12:04 INFO HoodieFileIndex: Time cost to listing files: 37495ms
   23/06/25 18:15:14 INFO HoodieFileIndex: Time cost to listing files: 43496ms
   ```
   
   ### After the pr
   
   ```
   23/06/25 18:19:35 INFO HoodieFileIndex: Time cost to listing files: 10928ms
   23/06/25 18:20:29 INFO HoodieFileIndex: Time cost to listing files: 10015ms
   23/06/25 18:21:25 INFO HoodieFileIndex: Time cost to listing files: 12032ms
   ```
   
   SInce my backend storage is `HDFS`, I think it could save more time if using 
`ObjectStore`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to