Filed https://issues.apache.org/jira/browse/SPARK-24940. Will upload a patch shortly.
SPARK-20857 introduced a generic SQL Hint Framework since 2.2.0. On Thu, Jul 26, 2018 at 4:25 PM Reynold Xin <r...@databricks.com> wrote: > John, > > You want to create a ticket and submit a patch for this? If there is a > coalesce hint, inject a coalesce logical node. Pretty simple. > > > On Wed, Jul 25, 2018 at 2:48 PM John Zhuge <jzh...@apache.org> wrote: > >> Thanks for the comment, Forest. What I am asking is to make whatever DF >> repartition/coalesce functionalities available to SQL users. >> >> Agree with you on that reducing the final number of output files by file >> size is very nice to have. Lukas indicated this is planned. >> >> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang <forest.f...@outlook.com> >> wrote: >> >>> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was >>> referenced in John's email. Can you elaborate how is your requirement >>> different? In my experience, it usually is driven by the need to decrease >>> the final output parallelism without compromising compute parallelism (i.e. >>> to prevent too many small files to be persisted on HDFS.) The requirement >>> in my experience is often pretty ballpark and does not require precise >>> number of partitions. Therefore setting the desired output size to say >>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked >>> as won't fix. >>> >>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang <forest.f...@outlook.com> >>> wrote: >>> >>>> Has there been any discussion to simply support Hive's merge small >>>> files configuration? It simply adds one additional stage to inspect size of >>>> each output file, recompute the desired parallelism to reach a target size, >>>> and runs a map-only coalesce before committing the final files. Since AFAIK >>>> SparkSQL already stages the final output commit, it seems feasible to >>>> respect this Hive config. >>>> >>>> >>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html >>>> >>>> >>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> See some of the related discussion under >>>>> https://github.com/apache/spark/pull/21589 >>>>> >>>>> If feels to me like we need some kind of user code mechanism to signal >>>>> policy preferences to Spark. This could also include ways to signal >>>>> scheduling policy, which could include things like scheduling pool and/or >>>>> barrier scheduling. Some of those scheduling policies operate at >>>>> inherently >>>>> different levels currently -- e.g. scheduling pools at the Job level >>>>> (really, the thread local level in the current implementation) and barrier >>>>> scheduling at the Stage level -- so it is not completely obvious how to >>>>> unify all of these policy options/preferences/mechanism, or whether it is >>>>> possible, but I think it is worth considering such things at a fairly high >>>>> level of abstraction and try to unify and simplify before making things >>>>> more complex with multiple policy mechanisms. >>>>> >>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Seems like a good idea in general. Do other systems have similar >>>>>> concepts? In general it'd be easier if we can follow existing convention >>>>>> if >>>>>> there is any. >>>>>> >>>>>> >>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jzh...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Many Spark users in my company are asking for a way to control the >>>>>>> number of output files in Spark SQL. There are use cases to either >>>>>>> reduce >>>>>>> or increase the number. The users prefer not to use function >>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to >>>>>>> write and deploy Scala/Java/Python code. >>>>>>> >>>>>>> Could we introduce a query hint for this purpose (similar to >>>>>>> Broadcast Join Hints)? >>>>>>> >>>>>>> /*+ *COALESCE*(n, shuffle) */ >>>>>>> >>>>>>> In general, is query hint is the best way to bring DF functionality >>>>>>> to SQL without extending SQL syntax? Any suggestion is highly >>>>>>> appreciated. >>>>>>> >>>>>>> This requirement is not the same as SPARK-6221 that asked for >>>>>>> auto-merging output files. >>>>>>> >>>>>>> Thanks, >>>>>>> John Zhuge >>>>>>> >>>>>> >> >> -- >> John Zhuge >> > -- John Zhuge