Xiangrui and I are leading an effort to implement a highly desirable
feature, Barrier Execution Mode.
https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
scheduling model to Apache Spark so users can properly embed distributed DL
training as a Spark stage to simplify the
Thanks for the comment, Forest. What I am asking is to make whatever DF
repartition/coalesce functionalities available to SQL users.
Agree with you on that reducing the final number of output files by file
size is very nice to have. Lukas indicated this is planned.
On Wed, Jul 25, 2018 at 2:31
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK
Hi,
Yes, This feature is planned - Spark should be soon able to repartition
output by size.
Lukas
Dne st 25. 7. 2018 23:26 uživatel Forest Fang
napsal:
> Has there been any discussion to simply support Hive's merge small files
> configuration? It simply adds one additional stage to inspect
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK
See some of the related discussion under
https://github.com/apache/spark/pull/21589
If feels to me like we need some kind of user code mechanism to signal
policy preferences to Spark. This could also include ways to signal
scheduling policy, which could include things like scheduling pool and/or
Seems like a good idea in general. Do other systems have similar concepts?
In general it'd be easier if we can follow existing convention if there is
any.
On Wed, Jul 25, 2018 at 11:50 AM John Zhuge wrote:
> Hi all,
>
> Many Spark users in my company are asking for a way to control the number
Quick update: I've updated my PR to add the table catalog API to implement
this proposal. Here's the PR: https://github.com/apache/spark/pull/21306
On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote:
> Lately, I’ve been working on implementing the new SQL logical plans. I’m
> currently blocked
Hi all,
Many Spark users in my company are asking for a way to control the number
of output files in Spark SQL. There are use cases to either reduce or
increase the number. The users prefer not to use function *repartition*(n)
or *coalesce*(n, shuffle) that require them to write and deploy
16 matches
Mail list logo