How many tasks are there in the write job? Since each task may write one file for each partition, you may end up with taskNum * 31 files.

Increasing SPLIT_MINSIZE does help reducing task number. Another way to address this issue is to use DataFrame.coalesce(n) to shrink task number to n explicitly.

Cheng

On 10/7/15 6:40 PM, Younes Naguib wrote:
Thanks, I'll try that.

*Younes Naguib**Streaming Division*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8**

Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>**

------------------------------------------------------------------------
*From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org]
*Sent:* Wednesday, October 07, 2015 9:14 PM
*To:* Younes Naguib
*Cc:* Cheng Lian; user@spark.apache.org
*Subject:* Re: Parquet file size

Hi,

In our case, we're using the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to increase the size of the RDD partitions when loading text files, so it would generate larger parquet files. We just set it in the Hadoop conf of the SparkContext. You need to be careful though about setting it to a large value, as you might encounter issues related to this:
https://issues.apache.org/jira/browse/SPARK-6235

For our jobs, we're setting the split size to 512MB which generates between 110-200MB parquet files using the default compression. We're using Spark-1.3.1, btw, and we also have the same partitioning of year/month/day for our parquet files.

HTH,
Deng

On Thu, Oct 8, 2015 at 8:25 AM, Younes Naguib <younes.nag...@tritondigital.com <mailto:younes.nag...@tritondigital.com>> wrote:

    Well, I only have data for 2015-08. So, in the end, only 31
    partitions....
    What I'm looking for, is some reasonably sized partitions.
    In any case, just the idea of controlling the output parquet files
    size or number would be nice.

    *Younes Naguib**Streaming Division*

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8**

    Tel.: +1 514 448 4037 x2688 <tel:%2B1%20514%20448%204037%20x2688>
    | Tel.: +1 866 448 4037 x2688
    <tel:%2B1%20866%20448%204037%20x2688> |
    younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>**

    ------------------------------------------------------------------------
    *From:* Cheng Lian [lian.cs....@gmail.com
    <mailto:lian.cs....@gmail.com>]
    *Sent:* Wednesday, October 07, 2015 7:01 PM

    *To:* Younes Naguib; 'user@spark.apache.org
    <mailto:user@spark.apache.org>'
    *Subject:* Re: Parquet file size

    The reason why so many small files are generated should probably
    be the fact that you are inserting into a partitioned table with
    three partition columns.

    If you want a large Parquet files, you may try to either avoid
    using partitioned table, or using less partition columns (e.g.,
    only year, without month and day).

    Cheng

    So you want to dump all data into a single large Parquet file?

    On 10/7/15 1:55 PM, Younes Naguib wrote:

    The TSV original files is 600GB and generated 40k files of 15-25MB.

    y

    *From:*Cheng Lian [mailto:lian.cs....@gmail.com]
    *Sent:* October-07-15 3:18 PM
    *To:* Younes Naguib; 'user@spark.apache.org
    <mailto:user@spark.apache.org>'
    *Subject:* Re: Parquet file size

    Why do you want larger files? Doesn't the result Parquet file
    contain all the data in the original TSV file?

    Cheng

    On 10/7/15 11:07 AM, Younes Naguib wrote:

        Hi,

        I’m reading a large tsv file, and creating parquet files
        using sparksql:

        insert overwrite

        table tbl partition(year, month, day)....

        Select .... from tbl_tsv;

        This works nicely, but generates small parquet files (15MB).

        I wanted to generate larger files, any idea how to address this?

        *Thanks,*

        *Younes Naguib*

        Triton Digital | 1440 Ste-Catherine W., Suite 1200 |
        Montreal, QC  H3G 1R8

        Tel.: +1 514 448 4037 x2688
        <tel:%2B1%20514%20448%204037%20x2688> | Tel.: +1 866 448 4037
        x2688 <tel:%2B1%20866%20448%204037%20x2688> |
        younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>




Reply via email to