Re: parquet vs orc files

2018-03-01 Thread Sushrut Ikhar
To add, schema evaluation is better for parquet compared to orc (at the cost of a bit slowness) as orc is truly index based; especially useful in case you would want to delete some column later. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Re: parquet vs orc files

2018-02-22 Thread Jörn Franke
Look at the documentation of the formats. In any case: * use additionally partitions on the filesystem * sort the data on filter columns - otherwise you do not benefit form min/max and bloom filters > On 21. Feb 2018, at 22:58, Kane Kim wrote: > > Thanks, how does

Re: parquet vs orc files

2018-02-22 Thread Kurt Fehlhauer
Hi Kane, It really depends on your use case. I generally use Parquet because it seems to have better support beyond Spark. However, if you are dealing with partitioned Hive tables, the current versions of Spark have an issue where compression will not be applied. This will be fixed in version

Re: parquet vs orc files

2018-02-21 Thread Stephen Joung
In case of parquet, best source for me to configure and to ensure "min/max statistics" was https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide --- I don't have any experience in orc. 2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성: >

Re: parquet vs orc files

2018-02-21 Thread Kane Kim
Thanks, how does min/max index work? Can spark itself configure bloom filters when saving as orc? On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke wrote: > In the latest version both are equally well supported. > > You need to insert the data sorted on filtering columns > Then

Re: parquet vs orc files

2018-02-21 Thread Jörn Franke
In the latest version both are equally well supported. You need to insert the data sorted on filtering columns Then you will benefit from min max indexes and in case of orc additional from bloom filters, if you configure them. In any case I recommend also partitioning of files (do not confuse

parquet vs orc files

2018-02-21 Thread Kane Kim
Hello, Which format is better supported in spark, parquet or orc? Will spark use internal sorting of parquet/orc files (and how to test that)? Can spark save sorted parquet/orc files? Thanks!