[ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105530#comment-17105530 ]
philipse commented on SPARK-31588: ---------------------------------- Thanks Hyukjin for your advice , i will reconsider it. > merge small files may need more common setting > ---------------------------------------------- > > Key: SPARK-31588 > URL: https://issues.apache.org/jira/browse/SPARK-31588 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.5 > Environment: spark:2.4.5 > hdp:2.7 > Reporter: philipse > Priority: Major > > Hi , > SparkSql now allow us to use repartition or coalesce to manually control the > small files like the following > /*+ REPARTITION(1) */ > /*+ COALESCE(1) */ > But it can only be tuning case by case ,we need to decide whether we need to > use COALESCE or REPARTITION,can we try a more common way to reduce the > decision by set the target size as hive did > *Good points:* > 1)we will also the new partitions number > 2)with an ON-OFF parameter provided , user can close it if needed > 3)the parmeter can be set at cluster level instand of user side,it will be > more easier to controll samll files. > 4)greatly reduce the pressue of namenode > > *Not good points:* > 1)It will add a new task to calculate the target numbers by stastics the out > files. > > I don't know whether we have planned this in future. > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org