Re: [DISCUSS][SQL] Control the number of output files

lukas nalezenec Mon, 06 Aug 2018 14:29:01 -0700

Hi Koert,
There is no such Jira yet. We need SPARK-23889 before. You can find some
mentions in the design document inside 23889.
Best regards
Lukas


2018-08-06 18:34 GMT+02:00 Koert Kuipers <ko...@tresata.com>:

> i went through the jiras targeting 2.4.0 trying to find a feature where
> spark would coalesce/repartition by size (so merge small files
> automatically), but didn't find it.
> can someone point me to it?
> thank you.
> best,
> koert
>
> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org>
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <forest.f...@outlook.com>
>>> napsal:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-mult
>>>> iple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> See some of the related discussion under https://github.com/apach
>>>>> e/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at 
>>>>> inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>>> if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either 
>>>>>>> reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>>> appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to