[
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101775#comment-17101775
]
philipse commented on SPARK-31588:
----------------------------------
For example:
if we have output 3 files,size as 10M,50M,200M,the block size as 128M,we may
keep the file size more close the average,but we also should keep the size
bigger than the block, just in case someone set wrong paramters.
case 1:we set the target size as 60M.the expected average file size as
Max(blocksize,60M) it will output an integer file count as the repartition
number :[total_file_size /average file size]+1
the final result will be 3 files:size as 128M,128M,4M
if we set the target size as 5120M, then it will repartition as 1 file. size as
260M.
thus ,we can set the target size as the global paramter,it will benefit all
task.
> merge small files may need more common setting
> ----------------------------------------------
>
> Key: SPARK-31588
> URL: https://issues.apache.org/jira/browse/SPARK-31588
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.5
> Environment: spark:2.4.5
> hdp:2.7
> Reporter: philipse
> Priority: Major
>
> Hi ,
> SparkSql now allow us to use repartition or coalesce to manually control the
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be tuning case by case ,we need to decide whether we need to
> use COALESCE or REPARTITION,can we try a more common way to reduce the
> decision by set the target size as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out
> files.
>
> I don't know whether we have planned this in future.
>
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]