To add, schema evaluation is better for parquet compared to orc (at the
cost of a bit slowness) as orc is truly index based;
especially useful in case you would want to delete some column later.
Regards,
Sushrut Ikhar
[image: https://]about.me/sushrutikhar
Look at the documentation of the formats. In any case:
* use additionally partitions on the filesystem
* sort the data on filter columns - otherwise you do not benefit form min/max
and bloom filters
> On 21. Feb 2018, at 22:58, Kane Kim wrote:
>
> Thanks, how does
Hi Kane,
It really depends on your use case. I generally use Parquet because it
seems to have better support beyond Spark. However, if you are dealing with
partitioned Hive tables, the current versions of Spark have an issue where
compression will not be applied. This will be fixed in version
In case of parquet, best source for me to configure and to ensure "min/max
statistics" was
https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide
---
I don't have any experience in orc.
2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성:
>
Thanks, how does min/max index work? Can spark itself configure bloom
filters when saving as orc?
On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke wrote:
> In the latest version both are equally well supported.
>
> You need to insert the data sorted on filtering columns
> Then
In the latest version both are equally well supported.
You need to insert the data sorted on filtering columns
Then you will benefit from min max indexes and in case of orc additional from
bloom filters, if you configure them.
In any case I recommend also partitioning of files (do not confuse
Hello,
Which format is better supported in spark, parquet or orc?
Will spark use internal sorting of parquet/orc files (and how to test that)?
Can spark save sorted parquet/orc files?
Thanks!