[GitHub] [hudi] umehrot2 commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

GitBox Fri, 17 Jul 2020 17:16:30 -0700


umehrot2 commented on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-660389870



   @zuyanton In your test you with regular parquet tables are probably not 
setting the following property in the spark config 
```spark.sql.hive.convertMetastoreParquet=false```. When you set this property 
to ```false`` only then will Spark use `Parquet InputFormat` as well as its 
listing code. Otherwise by default Spark uses its native listing (parallelized 
over the cluster) and parquet readers which are supposed to be faster.
   
   However the way Hudi works is it uses `InputFormat` implementation. Thus for 
a fair comparison when you test regular parquet with Spark you should set 
```spark.sql.hive.convertMetastoreParquet=false``` and I think you will observe 
quite similar behavior then as to what you are seeing. Would you mind trying 
that out once ?
   
   But @bvaradar irrespective I think for Hudi we should always compare our 
performance against standard spark performance (native listing and reading) and 
not the performance of spark when it is made to go through InputFormat. So we 
need to get this fixed either ways if we have to be comparable to spark parquet 
performance which uses parallelized listing over the cluster. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

Reply via email to