[GitHub] [hudi] rubenssoto commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

GitBox Mon, 25 Jan 2021 05:37:14 -0800


rubenssoto commented on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182



   @vinothchandar 
   
   Thank you so much for your answer.
   When do you plan to release this version? I will try to make some 
workarounds until then.
   
   
   Is this configuration right?
   ```
   { "conf": {
               "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
               "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
               "spark.jars": 
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
               "spark.sql.hive.convertMetastoreParquet": "false",
               "spark.hadoop.hoodie.metadata.enable": "true"}
   }
   ```
   
   I made these 2 queries:
   
   spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
   
   
   %%sql 
   select count('*') from raw_courier_api.order_test
   
   On the pyspark query spark creates a job with 143 tasks, after 10 seconds of 
listing the count was fast, but in the spark sql query spark creates a job with 
2000 tasks and was very slow, is it a Hudi or spark issue?
   
   Thank you so much!
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

Reply via email to