Re: [DISCUSS][SQL] Control the number of output files

2018-08-10 Thread Koert Kuipers
we have found that to make shuffles reliable without OOMs we need to have spark.sql.shuffle.partitions at a high number, bigger than 2000 at least. yet this leads to a large amount of part files, which puts big pressure on spark driver programs. i tried to mitigate this with dataframe.coalesce to

Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread lukas nalezenec
Hi Koert, There is no such Jira yet. We need SPARK-23889 before. You can find some mentions in the design document inside 23889. Best regards Lukas 2018-08-06 18:34 GMT+02:00 Koert Kuipers : > i went through the jiras targeting 2.4.0 trying to find a feature where > spark would coalesce/repartiti

Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread Koert Kuipers
i went through the jiras targeting 2.4.0 trying to find a feature where spark would coalesce/repartition by size (so merge small files automatically), but didn't find it. can someone point me to it? thank you. best, koert On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers wrote: > lukas, > what is th

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
Great help from the community! On Sun, Aug 5, 2018 at 6:17 PM Xiao Li wrote: > FYI, the new hints have been merged. They will be available in the > upcoming release (Spark 2.4). > > *John Zhuge*, thanks for your work! Really appreciate it! Please submit > more PRs and help the community improve

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
https://issues.apache.org/jira/browse/SPARK-24940 The PR has been merged to 2.4.0. On Sun, Aug 5, 2018 at 6:06 PM Koert Kuipers wrote: > lukas, > what is the jira ticket for this? i would like to follow it's activity. > thanks! > koert > > On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec wrote

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread Xiao Li
FYI, the new hints have been merged. They will be available in the upcoming release (Spark 2.4). *John Zhuge*, thanks for your work! Really appreciate it! Please submit more PRs and help the community improve Spark. : ) Xiao 2018-08-05 21:06 GMT-04:00 Koert Kuipers : > lukas, > what is the jira

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread Koert Kuipers
lukas, what is the jira ticket for this? i would like to follow it's activity. thanks! koert On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec wrote: > Hi, > Yes, This feature is planned - Spark should be soon able to repartition > output by size. > Lukas > > > Dne st 25. 7. 2018 23:26 uživatel F

Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread John Zhuge
Filed https://issues.apache.org/jira/browse/SPARK-24940. Will upload a patch shortly. SPARK-20857 introduced a generic SQL Hint Framework since 2.2.0. On Thu, Jul 26, 2018 at 4:25 PM Reynold Xin wrote: > John, > > You want to create a ticket and submit a patch for this? If there is a > coalesce

Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread Reynold Xin
John, You want to create a ticket and submit a patch for this? If there is a coalesce hint, inject a coalesce logical node. Pretty simple. On Wed, Jul 25, 2018 at 2:48 PM John Zhuge wrote: > Thanks for the comment, Forest. What I am asking is to make whatever DF > repartition/coalesce function

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Thanks for the comment, Forest. What I am asking is to make whatever DF repartition/coalesce functionalities available to SQL users. Agree with you on that reducing the final number of output files by file size is very nice to have. Lukas indicated this is planned. On Wed, Jul 25, 2018 at 2:31 PM

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK S

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK S

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread lukas nalezenec
Hi, Yes, This feature is planned - Spark should be soon able to repartition output by size. Lukas Dne st 25. 7. 2018 23:26 uživatel Forest Fang napsal: > Has there been any discussion to simply support Hive's merge small files > configuration? It simply adds one additional stage to inspect size

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK S

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in John's email. Can you elaborate how is your requirement different? In my experience, it usually is driven by the need to decrease the final output parallelism without compromising compute parallelism (i.e. to prevent

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files configuration? It simply adds one additional stage to inspect size of each output file, recompute the desired parallelism to reach a target size, and runs a map-only coalesce before committing the final files. Since AFAIK S

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under https://github.com/apache/spark/pull/21589 If feels to me like we need some kind of user code mechanism to signal policy preferences to Spark. This could also include ways to signal scheduling policy, which could include things like scheduling pool and/or b

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Reynold Xin
Seems like a good idea in general. Do other systems have similar concepts? In general it'd be easier if we can follow existing convention if there is any. On Wed, Jul 25, 2018 at 11:50 AM John Zhuge wrote: > Hi all, > > Many Spark users in my company are asking for a way to control the number >

[DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Hi all, Many Spark users in my company are asking for a way to control the number of output files in Spark SQL. There are use cases to either reduce or increase the number. The users prefer not to use function *repartition*(n) or *coalesce*(n, shuffle) that require them to write and deploy Scala/J