umehrot2 commented on issue #1798: URL: https://github.com/apache/hudi/issues/1798#issuecomment-658385297
Like @bvaradar mentioned, in the first query the glob pattern matches with 950 folders which are then parallely listed across the cluster using spark context. In the second query the glob patter matches 4750 files because of the extra * and now spark has to parallely list 4750 paths using spark context. This most likely seems to be the cause of this performance difference. Added to this I think the time taken by **HoodieROTablePathFilter** which is applied per file might somehow be amplifying it more. Can you run a similar test queries on a simple parquet table (non-hudi table) and observe the performance difference in listing. I think you may see slightly similar behavior. ``` spark.read.parquet(globPath) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org