Re: [DISCUSS][SQL] Control the number of output files

Forest Fang Wed, 25 Jul 2018 14:37:49 -0700

Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in 
John's email. Can you elaborate how is your requirement different? In my 
experience, it usually is driven by the need to decrease the final output 
parallelism without compromising compute parallelism (i.e. to prevent too many 
small files to be persisted on HDFS.) The requirement in my experience is often 
pretty ballpark and does not require precise number of partitions. Therefore 
setting the desired output size to say 32-64mb usually gives a good enough 
result. I'm curious why 6221 was marked as won't fix.


On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
<forest.f...@outlook.com<mailto:forest.f...@outlook.com>> wrote:
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
<jzh...@apache.org<mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

    /*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge

Re: [DISCUSS][SQL] Control the number of output files

Reply via email to