
You want to create a ticket and submit a patch for this? If there is a
coalesce hint, inject a coalesce logical node. Pretty simple.

On Wed, Jul 25, 2018 at 2:48 PM John Zhuge <> wrote:

> Thanks for the comment, Forest. What I am asking is to make whatever DF
> repartition/coalesce functionalities available to SQL users.
> Agree with you on that reducing the final number of output files by file
> size is very nice to have. Lukas indicated this is planned.
> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang <>
> wrote:
>> Sorry I see was
>> referenced in John's email. Can you elaborate how is your requirement
>> different? In my experience, it usually is driven by the need to decrease
>> the final output parallelism without compromising compute parallelism (i.e.
>> to prevent too many small files to be persisted on HDFS.) The requirement
>> in my experience is often pretty ballpark and does not require precise
>> number of partitions. Therefore setting the desired output size to say
>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
>> as won't fix.
>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang <>
>> wrote:
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <>
>>> wrote:
>>>> See some of the related discussion under
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <>
>>>> wrote:
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>> if
>>>>> there is any.
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <> wrote:
>>>>>> Hi all,
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>> appreciated.
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>> Thanks,
>>>>>> John Zhuge
> --
> John Zhuge

Reply via email to