[jira] [Commented] (SPARK-31588) merge small files may need more common setting

philipse (Jira) Thu, 07 May 2020 08:23:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101775#comment-17101775
 ]


philipse commented on SPARK-31588:
----------------------------------

For example:

if we have output 3 files,size as 10M,50M,200M,the block size as 128M,we may 
keep the file size more close the average,but we also should keep the size 
bigger than the block, just in case someone set wrong paramters. 

case 1:we set the target size as 60M.the  expected average file size as 
Max(blocksize,60M) it will output an integer file count as the repartition 
number :[total_file_size /average file size]+1

the final result will be 3 files:size as 128M,128M,4M

 

if we set the target size as 5120M, then it will repartition as 1 file. size as 
 260M.

thus ,we can set the target size as the global paramter,it will benefit all 
task.

> merge small files may need more common setting
> ----------------------------------------------
>
>                 Key: SPARK-31588
>                 URL: https://issues.apache.org/jira/browse/SPARK-31588
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the 
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to 
> use COALESCE or REPARTITION,can we try a more common way to reduce the 
> decision by set the target size  as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be 
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out 
> files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-31588) merge small files may need more common setting

Reply via email to