UNSUBSCRIBE

2018-07-25 Thread sridhararao mutluri



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-25 Thread Xingbo Jiang
Xiangrui and I are leading an effort to implement a highly desirable
feature, Barrier Execution Mode.
https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
scheduling model to Apache Spark so users can properly embed distributed DL
training as a Spark stage to simplify the distributed training workflow.
The prototype has been demoed in the Spark Summit keynote. This new feature
got a very positive feedback from the whole community. The design doc and
pull requests got more comments than we initially anticipated. We want to
finish this feature in the upcoming release, Spark 2.4. Would it be
possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin :

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Thanks for the comment, Forest. What I am asking is to make whatever DF
repartition/coalesce functionalities available to SQL users.

Agree with you on that reducing the final number of output files by file
size is very nice to have. Lukas indicated this is planned.

On Wed, Jul 25, 2018 at 2:31 PM Forest Fang  wrote:

> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
> referenced in John's email. Can you elaborate how is your requirement
> different? In my experience, it usually is driven by the need to decrease
> the final output parallelism without compromising compute parallelism (i.e.
> to prevent too many small files to be persisted on HDFS.) The requirement
> in my experience is often pretty ballpark and does not require precise
> number of partitions. Therefore setting the desired output size to say
> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
> as won't fix.
>
> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
> wrote:
>
>> Has there been any discussion to simply support Hive's merge small files
>> configuration? It simply adds one additional stage to inspect size of each
>> output file, recompute the desired parallelism to reach a target size, and
>> runs a map-only coalesce before committing the final files. Since AFAIK
>> SparkSQL already stages the final output commit, it seems feasible to
>> respect this Hive config.
>>
>>
>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>
>>
>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>> wrote:
>>
>>> See some of the related discussion under
>>> https://github.com/apache/spark/pull/21589
>>>
>>> If feels to me like we need some kind of user code mechanism to signal
>>> policy preferences to Spark. This could also include ways to signal
>>> scheduling policy, which could include things like scheduling pool and/or
>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>> different levels currently -- e.g. scheduling pools at the Job level
>>> (really, the thread local level in the current implementation) and barrier
>>> scheduling at the Stage level -- so it is not completely obvious how to
>>> unify all of these policy options/preferences/mechanism, or whether it is
>>> possible, but I think it is worth considering such things at a fairly high
>>> level of abstraction and try to unify and simplify before making things
>>> more complex with multiple policy mechanisms.
>>>
>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:
>>>
 Seems like a good idea in general. Do other systems have similar
 concepts? In general it'd be easier if we can follow existing convention if
 there is any.


 On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:

> Hi all,
>
> Many Spark users in my company are asking for a way to control the
> number of output files in Spark SQL. There are use cases to either reduce
> or increase the number. The users prefer not to use function
> *repartition*(n) or *coalesce*(n, shuffle) that require them to write
> and deploy Scala/Java/Python code.
>
> Could we introduce a query hint for this purpose (similar to Broadcast
> Join Hints)?
>
> /*+ *COALESCE*(n, shuffle) */
>
> In general, is query hint is the best way to bring DF functionality to
> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>
> This requirement is not the same as SPARK-6221 that asked for
> auto-merging output files.
>
> Thanks,
> John Zhuge
>


-- 
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in 
John's email. Can you elaborate how is your requirement different? In my 
experience, it usually is driven by the need to decrease the final output 
parallelism without compromising compute parallelism (i.e. to prevent too many 
small files to be persisted on HDFS.) The requirement in my experience is often 
pretty ballpark and does not require precise number of partitions. Therefore 
setting the desired output size to say 32-64mb usually gives a good enough 
result. I'm curious why 6221 was marked as won't fix.

On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
mailto:forest.f...@outlook.com>> wrote:
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in 
John's email. Can you elaborate how is your requirement different? In my 
experience, it usually is driven by the need to decrease the final output 
parallelism without compromising compute parallelism (i.e. to prevent too many 
small files to be persisted on HDFS.) The requirement in my experience is often 
pretty ballpark and does not require precise number of partitions. Therefore 
setting the desired output size to say 32-64mb usually gives a good enough 
result. I'm curious why 6221 was marked as won't fix.

On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
mailto:forest.f...@outlook.com>> wrote:
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread lukas nalezenec
Hi,
Yes, This feature is planned - Spark should be soon able to repartition
output by size.
Lukas


Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
napsal:

> Has there been any discussion to simply support Hive's merge small files
> configuration? It simply adds one additional stage to inspect size of each
> output file, recompute the desired parallelism to reach a target size, and
> runs a map-only coalesce before committing the final files. Since AFAIK
> SparkSQL already stages the final output commit, it seems feasible to
> respect this Hive config.
>
>
> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>
>
> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
> wrote:
>
>> See some of the related discussion under
>> https://github.com/apache/spark/pull/21589
>>
>> If feels to me like we need some kind of user code mechanism to signal
>> policy preferences to Spark. This could also include ways to signal
>> scheduling policy, which could include things like scheduling pool and/or
>> barrier scheduling. Some of those scheduling policies operate at inherently
>> different levels currently -- e.g. scheduling pools at the Job level
>> (really, the thread local level in the current implementation) and barrier
>> scheduling at the Stage level -- so it is not completely obvious how to
>> unify all of these policy options/preferences/mechanism, or whether it is
>> possible, but I think it is worth considering such things at a fairly high
>> level of abstraction and try to unify and simplify before making things
>> more complex with multiple policy mechanisms.
>>
>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:
>>
>>> Seems like a good idea in general. Do other systems have similar
>>> concepts? In general it'd be easier if we can follow existing convention if
>>> there is any.
>>>
>>>
>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>>>
 Hi all,

 Many Spark users in my company are asking for a way to control the
 number of output files in Spark SQL. There are use cases to either reduce
 or increase the number. The users prefer not to use function
 *repartition*(n) or *coalesce*(n, shuffle) that require them to write
 and deploy Scala/Java/Python code.

 Could we introduce a query hint for this purpose (similar to Broadcast
 Join Hints)?

 /*+ *COALESCE*(n, shuffle) */

 In general, is query hint is the best way to bring DF functionality to
 SQL without extending SQL syntax? Any suggestion is highly appreciated.

 This requirement is not the same as SPARK-6221 that asked for
 auto-merging output files.

 Thanks,
 John Zhuge

>>>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in 
John's email. Can you elaborate how is your requirement different? In my 
experience, it usually is driven by the need to decrease the final output 
parallelism without compromising compute parallelism (i.e. to prevent too many 
small files to be persisted on HDFS.) The requirement in my experience is often 
pretty ballpark and does not require precise number of partitions. Therefore 
setting the desired output size to say 32-64mb usually gives a good enough 
result. I'm curious why 6221 was marked as won't fix.

On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
mailto:forest.f...@outlook.com>> wrote:
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was referenced in 
John's email. Can you elaborate how is your requirement different? In my 
experience, it usually is driven by the need to decrease the final output 
parallelism without compromising compute parallelism (i.e. to prevent too many 
small files to be persisted on HDFS.) The requirement in my experience is often 
pretty ballpark and does not require precise number of partitions. Therefore 
setting the desired output size to say 32-64mb usually gives a good enough 
result. I'm curious why 6221 was marked as won't fix.

On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
mailto:forest.f...@outlook.com>> wrote:
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Forest Fang
Has there been any discussion to simply support Hive's merge small files 
configuration? It simply adds one additional stage to inspect size of each 
output file, recompute the desired parallelism to reach a target size, and runs 
a map-only coalesce before committing the final files. Since AFAIK SparkSQL 
already stages the final output commit, it seems feasible to respect this Hive 
config.

https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html


On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
See some of the related discussion under 
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal policy 
preferences to Spark. This could also include ways to signal scheduling policy, 
which could include things like scheduling pool and/or barrier scheduling. Some 
of those scheduling policies operate at inherently different levels currently 
-- e.g. scheduling pools at the Job level (really, the thread local level in 
the current implementation) and barrier scheduling at the Stage level -- so it 
is not completely obvious how to unify all of these policy 
options/preferences/mechanism, or whether it is possible, but I think it is 
worth considering such things at a fairly high level of abstraction and try to 
unify and simplify before making things more complex with multiple policy 
mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Seems like a good idea in general. Do other systems have similar concepts? In 
general it'd be easier if we can follow existing convention if there is any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Hi all,

Many Spark users in my company are asking for a way to control the number of 
output files in Spark SQL. There are use cases to either reduce or increase the 
number. The users prefer not to use function repartition(n) or coalesce(n, 
shuffle) that require them to write and deploy Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join 
Hints)?

/*+ COALESCE(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL 
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging 
output files.

Thanks,
John Zhuge


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal
policy preferences to Spark. This could also include ways to signal
scheduling policy, which could include things like scheduling pool and/or
barrier scheduling. Some of those scheduling policies operate at inherently
different levels currently -- e.g. scheduling pools at the Job level
(really, the thread local level in the current implementation) and barrier
scheduling at the Stage level -- so it is not completely obvious how to
unify all of these policy options/preferences/mechanism, or whether it is
possible, but I think it is worth considering such things at a fairly high
level of abstraction and try to unify and simplify before making things
more complex with multiple policy mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:

> Seems like a good idea in general. Do other systems have similar concepts?
> In general it'd be easier if we can follow existing convention if there is
> any.
>
>
> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>
>> Hi all,
>>
>> Many Spark users in my company are asking for a way to control the number
>> of output files in Spark SQL. There are use cases to either reduce or
>> increase the number. The users prefer not to use function *repartition*(n)
>> or *coalesce*(n, shuffle) that require them to write and deploy
>> Scala/Java/Python code.
>>
>> Could we introduce a query hint for this purpose (similar to Broadcast
>> Join Hints)?
>>
>> /*+ *COALESCE*(n, shuffle) */
>>
>> In general, is query hint is the best way to bring DF functionality to
>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>
>> This requirement is not the same as SPARK-6221 that asked for
>> auto-merging output files.
>>
>> Thanks,
>> John Zhuge
>>
>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Reynold Xin
Seems like a good idea in general. Do other systems have similar concepts?
In general it'd be easier if we can follow existing convention if there is
any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:

> Hi all,
>
> Many Spark users in my company are asking for a way to control the number
> of output files in Spark SQL. There are use cases to either reduce or
> increase the number. The users prefer not to use function *repartition*(n)
> or *coalesce*(n, shuffle) that require them to write and deploy
> Scala/Java/Python code.
>
> Could we introduce a query hint for this purpose (similar to Broadcast
> Join Hints)?
>
> /*+ *COALESCE*(n, shuffle) */
>
> In general, is query hint is the best way to bring DF functionality to SQL
> without extending SQL syntax? Any suggestion is highly appreciated.
>
> This requirement is not the same as SPARK-6221 that asked for auto-merging
> output files.
>
> Thanks,
> John Zhuge
>


Re: [DISCUSS] Multiple catalog support

2018-07-25 Thread Ryan Blue
Quick update: I've updated my PR to add the table catalog API to implement
this proposal. Here's the PR: https://github.com/apache/spark/pull/21306

On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue  wrote:

> Lately, I’ve been working on implementing the new SQL logical plans. I’m
> currently blocked working on the plans that require table metadata
> operations. For example, CTAS will be implemented as a create table and a
> write using DSv2 (and a drop table if anything goes wrong). That requires
> something to expose the create and drop table actions: a table catalog.
>
> Initially, I opened #21306 
> to get a table catalog from the data source, but that’s a bad idea because
> it conflicts with future multi-catalog support. Sources are an
> implementation of a read and write API that can be shared between catalogs.
> For example, you could have prod and test HMS catalogs that both use the
> Parquet source. The Parquet source shouldn’t determine whether a CTAS
> statement creates a table in prod or test.
>
> That means that CTAS and other plans for DataSourceV2 need a solution to
> determine the catalog to use.
> Proposal
>
> I propose we add support for multiple catalogs now in support of the
> DataSourceV2 work, to avoid hacky work-arounds.
>
> First, I think we need to add catalog to TableIdentifier so tables are
> identified by catalog.db.table, not just db.table. This would make it
> easy to specify the intended catalog for SQL statements, like CREATE
> cat.db.table AS ..., and in the DataFrame API:
> df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table").
>
> Second, we will need an API for catalogs to implement. The SPIP on APIs
> for Table Metadata
> 
> already proposed the API for create/alter/drop table operations. The only
> part that is missing is how to register catalogs instead of using
> DataSourceV2 to instantiate them.
>
> I think we should configure catalogs through Spark config properties, like
> this:
>
> spark.sql.catalog. = 
> spark.sql.catalog.. = 
>
> When a catalog is referenced by name, Spark would instantiate the
> specified class using a no-arg constructor. The instance would then be
> configured by passing a map of the remaining pairs in the
> spark.sql.catalog..* namespace to a configure method with the
> namespace part removed and an extra “name” parameter with the catalog name.
> This would support external sources like JDBC, which have common options
> like driver or hostname and port.
> Backward-compatibility
>
> The current spark.catalog / ExternalCatalog would be used when the
> catalog element of a TableIdentifier is left blank. That would provide
> backward-compatibility. We could optionally allow users to control the
> default table catalog with a property.
> Relationship between catalogs and data sources
>
> In the proposed table catalog API, actions return a Table object that
> exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs
> would share data source implementations by returning Table instances that
> use the correct data source. V2 sources would no longer need to be loaded
> by reflection; the catalog would be loaded instead.
>
> Tables created using format("source") or USING source in SQL specify the
> data source implementation directly. This “format” should be passed to the
> source as a table property. The existing ExternalCatalog will need to
> implement the new TableCatalog API for v2 sources and would continue to
> use the property to determine the table’s data source or format
> implementation. Other table catalog implementations would be free to
> interpret the format string as they choose or to use it to choose a data
> source implementation as in the default catalog.
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


[DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
Hi all,

Many Spark users in my company are asking for a way to control the number
of output files in Spark SQL. There are use cases to either reduce or
increase the number. The users prefer not to use function *repartition*(n)
or *coalesce*(n, shuffle) that require them to write and deploy
Scala/Java/Python code.

Could we introduce a query hint for this purpose (similar to Broadcast Join
Hints)?

/*+ *COALESCE*(n, shuffle) */

In general, is query hint is the best way to bring DF functionality to SQL
without extending SQL syntax? Any suggestion is highly appreciated.

This requirement is not the same as SPARK-6221 that asked for auto-merging
output files.

Thanks,
John Zhuge