Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread zhangxiongfei
Hi, I did some tests on Parquet Files with Spark SQL DataFrame API. I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size of each file is about 222M.Then read them with below code. val tfs

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei