subject:"Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon\?"

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin

It's because you did a repartition -- which rearranges all the data.

Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.

On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei 
wrote:

> Hi,
> I did some tests on Parquet Files with Spark SQL DataFrame API.
> I generated 36 gzip compressed parquet files by Spark SQL and stored them
> on Tachyon,The size of each file is about  222M.Then read them with below
> code.
> val tfs
> =sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick");
> Next,I just save this DataFrame onto HDFS with below code.It will generate
> 36 parquet files too,but the size of each file is about 265M
>
> tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick-parquet-tachyon");
> My question is Why the files on HDFS has different size with those on
> Tachyon even though they come from the same original data?
>
>
> Thanks
> Zhang Xiongfei
>
>

Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread zhangxiongfei

Hi,
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on 
Tachyon,The size of each file is about  222M.Then read them with below code.
val tfs 
=sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick");
Next,I just save this DataFrame onto HDFS with below code.It will generate 36 
parquet files too,but the size of each file is about 265M
tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick-parquet-tachyon");
My question is Why the files on HDFS has different size with those on Tachyon 
even though they come from the same original data?


Thanks
Zhang Xiongfei

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2 matches

Site Navigation

Mail list logo

Footer information