rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678550642


   Hi Guys Aws Support answer me, its the same topic that we debate here.
   
   Hello,
   
   Thank you for your patience. I have heard back from the Service team, and 
here's why such behavior has been observed when querying Apache Hudi tables:
   
   When running 'SELECT COUNT(1)' queries on Hudi tables using 
HoodieParquetInputFormat, Athena has to bypass it's own implementation of S3 
file listing. Thus Hudi tables can be much less efficient in a query where the 
bottleneck is the speed at which files are listed. The Apache Hudi community is 
already aware of there being a performance impact caused by their S3 listing 
logic[1], as also has been rightly suggested on the thread you created.
   
   Further, 'SELECT COUNT(1)' queries over either format are nearly 
instantaneous to process on the Query Engine and measure how quickly the S3 
listing completes. If you instead compare performance on more complex queries 
(that require meaningful work on both sides), you should see a less pronounced 
difference in the results.
   
   I hope this information helps. Feel free to reach out to me with any 
additional queries you may have on this topic. I will be glad to assist you!
   
   References:
   [1]. S3 slow file listing (Hudi) - 
https://github.com/apache/hudi/issues/1829 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to