Hi,
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on
Tachyon,The size of each file is about 222M.Then read them with below code.
val tfs
It's because you did a repartition -- which rearranges all the data.
Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.
On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei