Re: [DISCUSS][SQL] Control the number of output files

Reynold Xin Thu, 26 Jul 2018 16:26:11 -0700

John,

You want to create a ticket and submit a patch for this? If there is a
coalesce hint, inject a coalesce logical node. Pretty simple.



On Wed, Jul 25, 2018 at 2:48 PM John Zhuge <jzh...@apache.org> wrote:

> Thanks for the comment, Forest. What I am asking is to make whatever DF
> repartition/coalesce functionalities available to SQL users.
>
> Agree with you on that reducing the final number of output files by file
> size is very nice to have. Lukas indicated this is planned.
>
> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang <forest.f...@outlook.com>
> wrote:
>
>> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
>> referenced in John's email. Can you elaborate how is your requirement
>> different? In my experience, it usually is driven by the need to decrease
>> the final output parallelism without compromising compute parallelism (i.e.
>> to prevent too many small files to be persisted on HDFS.) The requirement
>> in my experience is often pretty ballpark and does not require precise
>> number of partitions. Therefore setting the desired output size to say
>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
>> as won't fix.
>>
>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang <forest.f...@outlook.com>
>> wrote:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>>
>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> See some of the related discussion under
>>>> https://github.com/apache/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>> if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>> appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>
> --
> John Zhuge
>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to