John, You want to create a ticket and submit a patch for this? If there is a coalesce hint, inject a coalesce logical node. Pretty simple.
On Wed, Jul 25, 2018 at 2:48 PM John Zhuge <jzh...@apache.org> wrote: > Thanks for the comment, Forest. What I am asking is to make whatever DF > repartition/coalesce functionalities available to SQL users. > > Agree with you on that reducing the final number of output files by file > size is very nice to have. Lukas indicated this is planned. > > On Wed, Jul 25, 2018 at 2:31 PM Forest Fang <forest.f...@outlook.com> > wrote: > >> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was >> referenced in John's email. Can you elaborate how is your requirement >> different? In my experience, it usually is driven by the need to decrease >> the final output parallelism without compromising compute parallelism (i.e. >> to prevent too many small files to be persisted on HDFS.) The requirement >> in my experience is often pretty ballpark and does not require precise >> number of partitions. Therefore setting the desired output size to say >> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked >> as won't fix. >> >> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang <forest.f...@outlook.com> >> wrote: >> >>> Has there been any discussion to simply support Hive's merge small files >>> configuration? It simply adds one additional stage to inspect size of each >>> output file, recompute the desired parallelism to reach a target size, and >>> runs a map-only coalesce before committing the final files. Since AFAIK >>> SparkSQL already stages the final output commit, it seems feasible to >>> respect this Hive config. >>> >>> >>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html >>> >>> >>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> See some of the related discussion under >>>> https://github.com/apache/spark/pull/21589 >>>> >>>> If feels to me like we need some kind of user code mechanism to signal >>>> policy preferences to Spark. This could also include ways to signal >>>> scheduling policy, which could include things like scheduling pool and/or >>>> barrier scheduling. Some of those scheduling policies operate at inherently >>>> different levels currently -- e.g. scheduling pools at the Job level >>>> (really, the thread local level in the current implementation) and barrier >>>> scheduling at the Stage level -- so it is not completely obvious how to >>>> unify all of these policy options/preferences/mechanism, or whether it is >>>> possible, but I think it is worth considering such things at a fairly high >>>> level of abstraction and try to unify and simplify before making things >>>> more complex with multiple policy mechanisms. >>>> >>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Seems like a good idea in general. Do other systems have similar >>>>> concepts? In general it'd be easier if we can follow existing convention >>>>> if >>>>> there is any. >>>>> >>>>> >>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Many Spark users in my company are asking for a way to control the >>>>>> number of output files in Spark SQL. There are use cases to either reduce >>>>>> or increase the number. The users prefer not to use function >>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to >>>>>> write and deploy Scala/Java/Python code. >>>>>> >>>>>> Could we introduce a query hint for this purpose (similar to >>>>>> Broadcast Join Hints)? >>>>>> >>>>>> /*+ *COALESCE*(n, shuffle) */ >>>>>> >>>>>> In general, is query hint is the best way to bring DF functionality >>>>>> to SQL without extending SQL syntax? Any suggestion is highly >>>>>> appreciated. >>>>>> >>>>>> This requirement is not the same as SPARK-6221 that asked for >>>>>> auto-merging output files. >>>>>> >>>>>> Thanks, >>>>>> John Zhuge >>>>>> >>>>> > > -- > John Zhuge >