umehrot2 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-658385297


   Like @bvaradar mentioned, in the first query the glob pattern matches with 
950 folders which are then parallely listed across the cluster using spark 
context. In the second query the glob patter matches 4750 files because of the 
extra * and now spark has to parallely list 4750 paths using spark context. 
This most likely seems to be the cause of this performance difference. Added to 
this I think the time taken by **HoodieROTablePathFilter** which is applied per 
file might somehow be amplifying it more.
   
   Can you run a similar test queries on a simple parquet table (non-hudi 
table) and observe the performance difference in listing. I think you may see 
slightly similar behavior.
   
   ```
   spark.read.parquet(globPath)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to