we have found that to make shuffles reliable without OOMs we need to have
spark.sql.shuffle.partitions at a high number, bigger than 2000 at least.
yet this leads to a large amount of part files, which puts big pressure on
spark driver programs.
i tried to mitigate this with dataframe.coalesce to
Hi Koert,
There is no such Jira yet. We need SPARK-23889 before. You can find some
mentions in the design document inside 23889.
Best regards
Lukas
2018-08-06 18:34 GMT+02:00 Koert Kuipers :
> i went through the jiras targeting 2.4.0 trying to find a feature where
> spark would coalesce/repartiti
i went through the jiras targeting 2.4.0 trying to find a feature where
spark would coalesce/repartition by size (so merge small files
automatically), but didn't find it.
can someone point me to it?
thank you.
best,
koert
On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers wrote:
> lukas,
> what is th
Great help from the community!
On Sun, Aug 5, 2018 at 6:17 PM Xiao Li wrote:
> FYI, the new hints have been merged. They will be available in the
> upcoming release (Spark 2.4).
>
> *John Zhuge*, thanks for your work! Really appreciate it! Please submit
> more PRs and help the community improve
https://issues.apache.org/jira/browse/SPARK-24940
The PR has been merged to 2.4.0.
On Sun, Aug 5, 2018 at 6:06 PM Koert Kuipers wrote:
> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec wrote
FYI, the new hints have been merged. They will be available in the upcoming
release (Spark 2.4).
*John Zhuge*, thanks for your work! Really appreciate it! Please submit
more PRs and help the community improve Spark. : )
Xiao
2018-08-05 21:06 GMT-04:00 Koert Kuipers :
> lukas,
> what is the jira
lukas,
what is the jira ticket for this? i would like to follow it's activity.
thanks!
koert
On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec wrote:
> Hi,
> Yes, This feature is planned - Spark should be soon able to repartition
> output by size.
> Lukas
>
>
> Dne st 25. 7. 2018 23:26 uživatel F
Filed https://issues.apache.org/jira/browse/SPARK-24940. Will upload a
patch shortly.
SPARK-20857 introduced a generic SQL Hint Framework since 2.2.0.
On Thu, Jul 26, 2018 at 4:25 PM Reynold Xin wrote:
> John,
>
> You want to create a ticket and submit a patch for this? If there is a
> coalesce
John,
You want to create a ticket and submit a patch for this? If there is a
coalesce hint, inject a coalesce logical node. Pretty simple.
On Wed, Jul 25, 2018 at 2:48 PM John Zhuge wrote:
> Thanks for the comment, Forest. What I am asking is to make whatever DF
> repartition/coalesce function
Thanks for the comment, Forest. What I am asking is to make whatever DF
repartition/coalesce functionalities available to SQL users.
Agree with you on that reducing the final number of output files by file
size is very nice to have. Lukas indicated this is planned.
On Wed, Jul 25, 2018 at 2:31 PM
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK S
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK S
Hi,
Yes, This feature is planned - Spark should be soon able to repartition
output by size.
Lukas
Dne st 25. 7. 2018 23:26 uživatel Forest Fang
napsal:
> Has there been any discussion to simply support Hive's merge small files
> configuration? It simply adds one additional stage to inspect size
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK S
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in
John's email. Can you elaborate how is your requirement different? In my
experience, it usually is driven by the need to decrease the final output
parallelism without compromising compute parallelism (i.e. to prevent
Has there been any discussion to simply support Hive's merge small files
configuration? It simply adds one additional stage to inspect size of each
output file, recompute the desired parallelism to reach a target size, and runs
a map-only coalesce before committing the final files. Since AFAIK S
See some of the related discussion under
https://github.com/apache/spark/pull/21589
If feels to me like we need some kind of user code mechanism to signal
policy preferences to Spark. This could also include ways to signal
scheduling policy, which could include things like scheduling pool and/or
b
Seems like a good idea in general. Do other systems have similar concepts?
In general it'd be easier if we can follow existing convention if there is
any.
On Wed, Jul 25, 2018 at 11:50 AM John Zhuge wrote:
> Hi all,
>
> Many Spark users in my company are asking for a way to control the number
>
Hi all,
Many Spark users in my company are asking for a way to control the number
of output files in Spark SQL. There are use cases to either reduce or
increase the number. The users prefer not to use function *repartition*(n)
or *coalesce*(n, shuffle) that require them to write and deploy
Scala/J
22 matches
Mail list logo