Well, I only have data for 2015-08. So, in the end, only 31 partitions....
What I'm looking for, is some reasonably sized partitions.
In any case, just the idea of controlling the output parquet files size or 
number would be nice.

Younes Naguib Streaming Division
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com <mailto:younes.nag...@streamtheworld.com>
________________________________
From: Cheng Lian [lian.cs....@gmail.com]
Sent: Wednesday, October 07, 2015 7:01 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size

The reason why so many small files are generated should probably be the fact 
that you are inserting into a partitioned table with three partition columns.

If you want a large Parquet files, you may try to either avoid using 
partitioned table, or using less partition columns (e.g., only year, without 
month and day).

Cheng

So you want to dump all data into a single large Parquet file?

On 10/7/15 1:55 PM, Younes Naguib wrote:
The TSV original files is 600GB and generated 40k files of 15-25MB.

y

From: Cheng Lian [mailto:lian.cs....@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org<mailto:user@spark.apache.org>'
Subject: Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file contain all the 
data in the original TSV file?

Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,

I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)....
Select .... from tbl_tsv;

This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com <mailto:younes.nag...@streamtheworld.com>



Reply via email to