Damian Momot created SPARK-18556:
------------------------------------

             Summary: Suboptimal number of tasks when writing partitioned data 
with desired number of files per directory
                 Key: SPARK-18556
                 URL: https://issues.apache.org/jira/browse/SPARK-18556
             Project: Spark
          Issue Type: Improvement
    Affects Versions: 2.0.2, 2.0.1, 2.0.0
            Reporter: Damian Momot


It's unable to have optimal number of write tasks when optimal number of files 
per directory is known, example:

When saving data to hdfs:

1. Data which is supposed to be partitioned by column (for example date) - it 
contains for example 90 different dates
2. Upfront knowledge that each date should be written into X files (for example 
4, because of recommended hdfs/parquet block size etc.)
3. During processing, dataset was partitioned into 200 partitions (for example 
because of some grouping operations)

currently we can do

{code}
val data: Dataset[Row] = ???

data
  .write
  .partitionBy("date")
  .parquet("/xyz")
{code}

This will properly write data into 90 date directories (see point '1') but each 
directory will contain 200 files (see point '3')

We can force number of files by using repartition/coalesce:

{code}
val data: Dataset[Row] = ???

data
  .repartition(4)
  .write
  .partitionBy("date")
  .parquet("xyz")
{code}

This will properly save 90 directories, 4 files each... but it will be done 
using only 4 tasks which is way too slow - 360 files could be written in 
parallel using 360 tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to