Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi,

We have many Spark jobs that create multiple small files. We would like to
improve analyst reading performance, doing so I'm testing the parquet
optimal file size.
I've found that the optimal file size should be around 1GB, and not less
than 128MB, depending on the size of the data.

I took one process to examine, in my process I'm using shuffle partitions =
600, which creates files of size 11MB. I've added a repartition part to
recreate less files - ~12 files of 600gb. After testing it (select * from
table where ...) I saw that the old version (with more files) ran faster
than the new one. I tried to increase the num of files to 40 - ~130MB each
file, and still it runs slower.

Would appreciate your experience with file sizes, and how to optimize the
num and size of files.

Thanks,
Tzahi


Re: Parquet file size

2015-10-08 Thread Cheng Lian
How many tasks are there in the write job? Since each task may write one 
file for each partition, you may end up with taskNum * 31 files.


Increasing SPLIT_MINSIZE does help reducing task number. Another way to 
address this issue is to use DataFrame.coalesce(n) to shrink task number 
to n explicitly.


Cheng

On 10/7/15 6:40 PM, Younes Naguib wrote:

Thanks, I'll try that.

*Younes Naguib**Streaming Division*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 
1R8**


Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>**



*From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng 
Ching-Mallete [och...@apache.org]

*Sent:* Wednesday, October 07, 2015 9:14 PM
*To:* Younes Naguib
*Cc:* Cheng Lian; user@spark.apache.org
*Subject:* Re: Parquet file size

Hi,

In our case, we're using 
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to 
increase the size of the RDD partitions when loading text files, so it 
would generate larger parquet files. We just set it in the Hadoop conf 
of the SparkContext. You need to be careful though about setting it to 
a large value, as you might encounter issues related to this:

https://issues.apache.org/jira/browse/SPARK-6235

For our jobs, we're setting the split size to 512MB which generates 
between 110-200MB parquet files using the default compression. We're 
using Spark-1.3.1, btw, and we also have the same partitioning of 
year/month/day for our parquet files.


HTH,
Deng

On Thu, Oct 8, 2015 at 8:25 AM, Younes Naguib 
<younes.nag...@tritondigital.com 
<mailto:younes.nag...@tritondigital.com>> wrote:


Well, I only have data for 2015-08. So, in the end, only 31
partitions
What I'm looking for, is some reasonably sized partitions.
In any case, just the idea of controlling the output parquet files
size or number would be nice.

*Younes Naguib**Streaming Division*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC 
H3G 1R8**


Tel.: +1 514 448 4037 x2688 <tel:%2B1%20514%20448%204037%20x2688>
| Tel.: +1 866 448 4037 x2688
<tel:%2B1%20866%20448%204037%20x2688> |
younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>**


*From:* Cheng Lian [lian.cs@gmail.com
<mailto:lian.cs@gmail.com>]
*Sent:* Wednesday, October 07, 2015 7:01 PM

*To:* Younes Naguib; 'user@spark.apache.org
<mailto:user@spark.apache.org>'
*Subject:* Re: Parquet file size

The reason why so many small files are generated should probably
be the fact that you are inserting into a partitioned table with
three partition columns.

If you want a large Parquet files, you may try to either avoid
using partitioned table, or using less partition columns (e.g.,
only year, without month and day).

Cheng

So you want to dump all data into a single large Parquet file?

On 10/7/15 1:55 PM, Younes Naguib wrote:


The TSV original files is 600GB and generated 40k files of 15-25MB.

y

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* October-07-15 3:18 PM
*To:* Younes Naguib; 'user@spark.apache.org
<mailto:user@spark.apache.org>'
*Subject:* Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file
contain all the data in the original TSV file?

Cheng

On 10/7/15 11:07 AM, Younes Naguib wrote:

Hi,

I’m reading a large tsv file, and creating parquet files
using sparksql:

insert overwrite

table tbl partition(year, month, day)

Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).

I wanted to generate larger files, any idea how to address this?

*Thanks,*

*Younes Naguib*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 |
Montreal, QC  H3G 1R8

Tel.: +1 514 448 4037 x2688
<tel:%2B1%20514%20448%204037%20x2688> | Tel.: +1 866 448 4037
x2688 <tel:%2B1%20866%20448%204037%20x2688> |
younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>








RE: Parquet file size

2015-10-07 Thread Younes Naguib
The TSV original files is 600GB and generated 40k files of 15-25MB.

y

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file contain all the 
data in the original TSV file?

Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,

I'm reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)
Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com <mailto:younes.nag...@streamtheworld.com>




Re: Parquet file size

2015-10-07 Thread Cheng Lian
The reason why so many small files are generated should probably be the 
fact that you are inserting into a partitioned table with three 
partition columns.


If you want a large Parquet files, you may try to either avoid using 
partitioned table, or using less partition columns (e.g., only year, 
without month and day).


Cheng

So you want to dump all data into a single large Parquet file?

On 10/7/15 1:55 PM, Younes Naguib wrote:


The TSV original files is 600GB and generated 40k files of 15-25MB.

y

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* October-07-15 3:18 PM
*To:* Younes Naguib; 'user@spark.apache.org'
*Subject:* Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file contain 
all the data in the original TSV file?


Cheng

On 10/7/15 11:07 AM, Younes Naguib wrote:

Hi,

I’m reading a large tsv file, and creating parquet files using
sparksql:

insert overwrite

table tbl partition(year, month, day)

Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).

I wanted to generate larger files, any idea how to address this?

*Thanks,*

*Younes Naguib*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC 
H3G 1R8


Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 |
younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>





Re: Parquet file size

2015-10-07 Thread Deng Ching-Mallete
Hi,

In our case, we're using
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to
increase the size of the RDD partitions when loading text files, so it
would generate larger parquet files. We just set it in the Hadoop conf of
the SparkContext. You need to be careful though about setting it to a large
value, as you might encounter issues related to this:

https://issues.apache.org/jira/browse/SPARK-6235

For our jobs, we're setting the split size to 512MB which generates between
110-200MB parquet files using the default compression. We're using
Spark-1.3.1, btw, and we also have the same partitioning of year/month/day
for our parquet files.

HTH,
Deng

On Thu, Oct 8, 2015 at 8:25 AM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Well, I only have data for 2015-08. So, in the end, only 31 partitions
> What I'm looking for, is some reasonably sized partitions.
> In any case, just the idea of controlling the output parquet files size or
> number would be nice.
>
> *Younes Naguib* *Streaming Division*
>
> Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
>
> Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com <younes.nag...@streamtheworld.com>
> --
> *From:* Cheng Lian [lian.cs@gmail.com]
> *Sent:* Wednesday, October 07, 2015 7:01 PM
>
> *To:* Younes Naguib; 'user@spark.apache.org'
> *Subject:* Re: Parquet file size
>
> The reason why so many small files are generated should probably be the
> fact that you are inserting into a partitioned table with three partition
> columns.
>
> If you want a large Parquet files, you may try to either avoid using
> partitioned table, or using less partition columns (e.g., only year,
> without month and day).
>
> Cheng
>
> So you want to dump all data into a single large Parquet file?
>
> On 10/7/15 1:55 PM, Younes Naguib wrote:
>
> The TSV original files is 600GB and generated 40k files of 15-25MB.
>
>
>
> y
>
>
>
> *From:* Cheng Lian [mailto:lian.cs....@gmail.com <lian.cs@gmail.com>]
> *Sent:* October-07-15 3:18 PM
> *To:* Younes Naguib; 'user@spark.apache.org'
> *Subject:* Re: Parquet file size
>
>
>
> Why do you want larger files? Doesn't the result Parquet file contain all
> the data in the original TSV file?
>
> Cheng
>
> On 10/7/15 11:07 AM, Younes Naguib wrote:
>
> Hi,
>
>
>
> I’m reading a large tsv file, and creating parquet files using sparksql:
>
> insert overwrite
>
> table tbl partition(year, month, day)
>
> Select  from tbl_tsv;
>
>
>
> This works nicely, but generates small parquet files (15MB).
>
> I wanted to generate larger files, any idea how to address this?
>
>
>
> *Thanks,*
>
> *Younes Naguib*
>
> Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
>
> Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com <younes.nag...@streamtheworld.com>
>
>
>


RE: Parquet file size

2015-10-07 Thread Younes Naguib
Well, I only have data for 2015-08. So, in the end, only 31 partitions
What I'm looking for, is some reasonably sized partitions.
In any case, just the idea of controlling the output parquet files size or 
number would be nice.

Younes Naguib Streaming Division
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com <mailto:younes.nag...@streamtheworld.com>

From: Cheng Lian [lian.cs@gmail.com]
Sent: Wednesday, October 07, 2015 7:01 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size

The reason why so many small files are generated should probably be the fact 
that you are inserting into a partitioned table with three partition columns.

If you want a large Parquet files, you may try to either avoid using 
partitioned table, or using less partition columns (e.g., only year, without 
month and day).

Cheng

So you want to dump all data into a single large Parquet file?

On 10/7/15 1:55 PM, Younes Naguib wrote:
The TSV original files is 600GB and generated 40k files of 15-25MB.

y

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org<mailto:user@spark.apache.org>'
Subject: Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file contain all the 
data in the original TSV file?

Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,

I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)
Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com <mailto:younes.nag...@streamtheworld.com>





Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain 
all the data in the original TSV file?


Cheng

On 10/7/15 11:07 AM, Younes Naguib wrote:


Hi,

I’m reading a large tsv file, and creating parquet files using sparksql:

insert overwrite

table tbl partition(year, month, day)

Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).

I wanted to generate larger files, any idea how to address this?

*Thanks,*

*Younes Naguib***

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8

Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com






Parquet file size

2015-10-07 Thread Younes Naguib
Hi,

I'm reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)
Select  from tbl_tsv;

This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com