UNSUBSCRIBE

2018-07-25 Thread sridhararao mutluri

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-25 Thread Xingbo Jiang
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Thanks for the comment, Forest. What I am asking is to make whatever DF repartition/coalesce functionalities available to SQL users. Agree with you on that reducing the final number of output files by file size is very nice to have. Lukas indicated this is planned. On Wed, Jul 25, 2018 at 2:31

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread lukas nalezenec
Hi, Yes, This feature is planned - Spark should be soon able to repartition output by size. Lukas Dne st 25. 7. 2018 23:26 uživatel Forest Fang napsal: > Has there been any discussion to simply support Hive's merge small files > configuration? It simply adds one additional stage to inspect

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under https://github.com/apache/spark/pull/21589 If feels to me like we need some kind of user code mechanism to signal policy preferences to Spark. This could also include ways to signal scheduling policy, which could include things like scheduling pool and/or

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Reynold Xin
Seems like a good idea in general. Do other systems have similar concepts? In general it'd be easier if we can follow existing convention if there is any. On Wed, Jul 25, 2018 at 11:50 AM John Zhuge wrote: > Hi all, > > Many Spark users in my company are asking for a way to control the number

Re: [DISCUSS] Multiple catalog support

2018-07-25 Thread Ryan Blue
Quick update: I've updated my PR to add the table catalog API to implement this proposal. Here's the PR: https://github.com/apache/spark/pull/21306 On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote: > Lately, I’ve been working on implementing the new SQL logical plans. I’m > currently blocked

[DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Hi all, Many Spark users in my company are asking for a way to control the number of output files in Spark SQL. There are use cases to either reduce or increase the number. The users prefer not to use function *repartition*(n) or *coalesce*(n, shuffle) that require them to write and deploy