Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread lukas nalezenec
Hi Koert,
There is no such Jira yet. We need SPARK-23889 before. You can find some
mentions in the design document inside 23889.
Best regards
Lukas

2018-08-06 18:34 GMT+02:00 Koert Kuipers :

> i went through the jiras targeting 2.4.0 trying to find a feature where
> spark would coalesce/repartition by size (so merge small files
> automatically), but didn't find it.
> can someone point me to it?
> thank you.
> best,
> koert
>
> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers  wrote:
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec 
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
>>> napsal:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-mult
>>>> iple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>>> wrote:
>>>>
>>>>> See some of the related discussion under https://github.com/apach
>>>>> e/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at 
>>>>> inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>>> if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either 
>>>>>>> reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>>> appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread lukas nalezenec
Hi,
Yes, This feature is planned - Spark should be soon able to repartition
output by size.
Lukas


Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
napsal:

> Has there been any discussion to simply support Hive's merge small files
> configuration? It simply adds one additional stage to inspect size of each
> output file, recompute the desired parallelism to reach a target size, and
> runs a map-only coalesce before committing the final files. Since AFAIK
> SparkSQL already stages the final output commit, it seems feasible to
> respect this Hive config.
>
>
> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>
>
> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
> wrote:
>
>> See some of the related discussion under
>> https://github.com/apache/spark/pull/21589
>>
>> If feels to me like we need some kind of user code mechanism to signal
>> policy preferences to Spark. This could also include ways to signal
>> scheduling policy, which could include things like scheduling pool and/or
>> barrier scheduling. Some of those scheduling policies operate at inherently
>> different levels currently -- e.g. scheduling pools at the Job level
>> (really, the thread local level in the current implementation) and barrier
>> scheduling at the Stage level -- so it is not completely obvious how to
>> unify all of these policy options/preferences/mechanism, or whether it is
>> possible, but I think it is worth considering such things at a fairly high
>> level of abstraction and try to unify and simplify before making things
>> more complex with multiple policy mechanisms.
>>
>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:
>>
>>> Seems like a good idea in general. Do other systems have similar
>>> concepts? In general it'd be easier if we can follow existing convention if
>>> there is any.
>>>
>>>
>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>>>
 Hi all,

 Many Spark users in my company are asking for a way to control the
 number of output files in Spark SQL. There are use cases to either reduce
 or increase the number. The users prefer not to use function
 *repartition*(n) or *coalesce*(n, shuffle) that require them to write
 and deploy Scala/Java/Python code.

 Could we introduce a query hint for this purpose (similar to Broadcast
 Join Hints)?

 /*+ *COALESCE*(n, shuffle) */

 In general, is query hint is the best way to bring DF functionality to
 SQL without extending SQL syntax? Any suggestion is highly appreciated.

 This requirement is not the same as SPARK-6221 that asked for
 auto-merging output files.

 Thanks,
 John Zhuge

>>>