Re: [DISCUSS][SQL] Control the number of output files

John Zhuge Thu, 26 Jul 2018 16:57:42 -0700

Filed https://issues.apache.org/jira/browse/SPARK-24940. Will upload a
patch shortly.


SPARK-20857 introduced a generic SQL Hint Framework since 2.2.0.

On Thu, Jul 26, 2018 at 4:25 PM Reynold Xin <r...@databricks.com> wrote:

> John,
>
> You want to create a ticket and submit a patch for this? If there is a
> coalesce hint, inject a coalesce logical node. Pretty simple.
>
>
> On Wed, Jul 25, 2018 at 2:48 PM John Zhuge <jzh...@apache.org> wrote:
>
>> Thanks for the comment, Forest. What I am asking is to make whatever DF
>> repartition/coalesce functionalities available to SQL users.
>>
>> Agree with you on that reducing the final number of output files by file
>> size is very nice to have. Lukas indicated this is planned.
>>
>> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang <forest.f...@outlook.com>
>> wrote:
>>
>>> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
>>> referenced in John's email. Can you elaborate how is your requirement
>>> different? In my experience, it usually is driven by the need to decrease
>>> the final output parallelism without compromising compute parallelism (i.e.
>>> to prevent too many small files to be persisted on HDFS.) The requirement
>>> in my experience is often pretty ballpark and does not require precise
>>> number of partitions. Therefore setting the desired output size to say
>>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
>>> as won't fix.
>>>
>>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang <forest.f...@outlook.com>
>>> wrote:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> See some of the related discussion under
>>>>> https://github.com/apache/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at 
>>>>> inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>>> if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either 
>>>>>>> reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>>> appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to