date:20190605

[jira] [Commented] (SPARK-27827) File does not exist notice is misleading in FileScanRDD

2019-06-05 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857297#comment-16857297
 ] 

zhoukang commented on SPARK-27827:
--

I just test this in 2.3 cluster [~dongjoon] 

> File does not exist notice is misleading in FileScanRDD
> ---
>
> Key: SPARK-27827
> URL: https://issues.apache.org/jira/browse/SPARK-27827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: zhoukang
>Priority: Minor
>
> When we encounter error below, we will try "refresh table" and will think the 
> error will not thrown again.
> {code:java}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> java.io.FileNotFoundException: File does not exist: 
> /user/s_xdata/kuduhive_warehouse/info_dev/dws_quality_time_dictionary/part-3-92c84bf9-99c0-49d9-8cdf-78b1844d75c3.snappy.parquet
> It is possible the underlying files have been updated. You can explicitly 
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
> SQL or by recreating the Dataset/DataFrame involved. (state=,code=0)
> {code}
> The cause is 'InMemoryFileIndex' will be cached in 'HiveMetaStoreCatalog'.And 
> refresh command will only invalidate table of current session.The notice is 
> misleading when we have a long-running thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27827) File does not exist notice is misleading in FileScanRDD

2019-06-05 Thread zhoukang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-27827:
-
Affects Version/s: (was: 2.4.3)

> File does not exist notice is misleading in FileScanRDD
> ---
>
> Key: SPARK-27827
> URL: https://issues.apache.org/jira/browse/SPARK-27827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: zhoukang
>Priority: Minor
>
> When we encounter error below, we will try "refresh table" and will think the 
> error will not thrown again.
> {code:java}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> java.io.FileNotFoundException: File does not exist: 
> /user/s_xdata/kuduhive_warehouse/info_dev/dws_quality_time_dictionary/part-3-92c84bf9-99c0-49d9-8cdf-78b1844d75c3.snappy.parquet
> It is possible the underlying files have been updated. You can explicitly 
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
> SQL or by recreating the Dataset/DataFrame involved. (state=,code=0)
> {code}
> The cause is 'InMemoryFileIndex' will be cached in 'HiveMetaStoreCatalog'.And 
> refresh command will only invalidate table of current session.The notice is 
> misleading when we have a long-running thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-06-05 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857291#comment-16857291
 ] 

zhoukang commented on SPARK-27068:
--

[~srowen] Here is a use case in our cluster.
We have a long running spark sql thriftserver, and users use that as ad-hoc 
query engine and also for a online bi service.
Since failure number is not too large. But total query number will quickly 
increased as show in image below.
When we want to find the root cause for the failed query, currently is really 
not too convenience.
 !屏幕快照 2019-06-06 下午1.12.04.png! 

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
> Attachments: 屏幕快照 2019-06-06 下午1.12.04.png
>
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-06-05 Thread zhoukang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-27068:
-
Attachment: 屏幕快照 2019-06-06 下午1.12.04.png

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
> Attachments: 屏幕快照 2019-06-06 下午1.12.04.png
>
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27965:
--
Issue Type: Improvement  (was: Bug)

> Add extractors for logical transforms
> -
>
> Key: SPARK-27965
> URL: https://issues.apache.org/jira/browse/SPARK-27965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> Extractors can be used to make any Transform class appear like a case class 
> to Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857232#comment-16857232
 ] 

Dongjoon Hyun commented on SPARK-27965:
---

Hi, [~rdblue]. Could you use `Improvement` issue type when you create this kind 
of issue?

> Add extractors for logical transforms
> -
>
> Key: SPARK-27965
> URL: https://issues.apache.org/jira/browse/SPARK-27965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> Extractors can be used to make any Transform class appear like a case class 
> to Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27964.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24813

> Create CatalogV2Util
> 
>
> Key: SPARK-27964
> URL: https://issues.apache.org/jira/browse/SPARK-27964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27964:
--
Issue Type: Improvement  (was: Bug)

> Create CatalogV2Util
> 
>
> Key: SPARK-27964
> URL: https://issues.apache.org/jira/browse/SPARK-27964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27964:


Assignee: Apache Spark

> Create CatalogV2Util
> 
>
> Key: SPARK-27964
> URL: https://issues.apache.org/jira/browse/SPARK-27964
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27964:


Assignee: (was: Apache Spark)

> Create CatalogV2Util
> 
>
> Key: SPARK-27964
> URL: https://issues.apache.org/jira/browse/SPARK-27964
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27965:


Assignee: Apache Spark

> Add extractors for logical transforms
> -
>
> Key: SPARK-27965
> URL: https://issues.apache.org/jira/browse/SPARK-27965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> Extractors can be used to make any Transform class appear like a case class 
> to Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27965:


Assignee: (was: Apache Spark)

> Add extractors for logical transforms
> -
>
> Key: SPARK-27965
> URL: https://issues.apache.org/jira/browse/SPARK-27965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> Extractors can be used to make any Transform class appear like a case class 
> to Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-27965:
-

 Summary: Add extractors for logical transforms
 Key: SPARK-27965
 URL: https://issues.apache.org/jira/browse/SPARK-27965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Extractors can be used to make any Transform class appear like a case class to 
Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-27964:
-

 Summary: Create CatalogV2Util
 Key: SPARK-27964
 URL: https://issues.apache.org/jira/browse/SPARK-27964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-06-05 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27931:

Description: 
This ticket contains two things:
1. Accept 'on' and 'off' as input for boolean data type
Example:
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Accept unique prefixes thereof:
Example:
{code:sql}
SELECT cast('of' as boolean) AS false;
SELECT cast('fal' as boolean) AS false;
{code}
3. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138

  was:
This ticket contains two things:
1. Accept 'on' and 'off' as input for boolean data type
Example:
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138


> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket contains two things:
> 1. Accept 'on' and 'off' as input for boolean data type
> Example:
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> Example:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
> [https://www.postgresql.org/docs/devel/datatype-boolean.html]
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
> Other DBs:
> http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
> https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
> https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27963) Allow dynamic allocation without an external shuffle service

2019-06-05 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857146#comment-16857146
 ] 

Marcelo Vanzin commented on SPARK-27963:


FYI I have a WIP patch to implement this that I plan to post soon (although 
I'll be out for a couple of weeks and won't be able to update it).

> Allow dynamic allocation without an external shuffle service
> 
>
> Key: SPARK-27963
> URL: https://issues.apache.org/jira/browse/SPARK-27963
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> It would be useful for users to be able to enable dynamic allocation without 
> the need to provision an external shuffle service. One immediate use case is 
> the ability to use dynamic allocation on Kubernetes, which doesn't yet have 
> that service.
> This has been suggested before (e.g. 
> https://github.com/apache/spark/pull/24083, which was attached to the 
> k8s-specific SPARK-24432), and can actually be done without affecting the 
> internals of the Spark scheduler (aside from the dynamic allocation code). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27963) Allow dynamic allocation without an external shuffle service

2019-06-05 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-27963:
--

 Summary: Allow dynamic allocation without an external shuffle 
service
 Key: SPARK-27963
 URL: https://issues.apache.org/jira/browse/SPARK-27963
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


It would be useful for users to be able to enable dynamic allocation without 
the need to provision an external shuffle service. One immediate use case is 
the ability to use dynamic allocation on Kubernetes, which doesn't yet have 
that service.

This has been suggested before (e.g. 
https://github.com/apache/spark/pull/24083, which was attached to the 
k8s-specific SPARK-24432), and can actually be done without affecting the 
internals of the Spark scheduler (aside from the dynamic allocation code). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27919) DataSourceV2: Add v2 session catalog

2019-06-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27919:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> DataSourceV2: Add v2 session catalog
> 
>
> Key: SPARK-27919
> URL: https://issues.apache.org/jira/browse/SPARK-27919
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> When no default catalog is set, the session catalog (v1) is responsible for 
> table identifiers with no catalog part. When CTAS creates a table with a v2 
> provider, a v2 catalog is required and the default catalog is used. But this 
> may cause Spark to create a table in a catalog that it cannot use to look up 
> the table.
> In this case, a v2 catalog that delegates to the session catalog should be 
> used instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements

2019-06-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27857.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 3.0.0

> DataSourceV2: Support ALTER TABLE statements
> 
>
> Key: SPARK-27857
> URL: https://issues.apache.org/jira/browse/SPARK-27857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2019-06-05 Thread Parshuram V Patki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857006#comment-16857006
 ] 

Parshuram V Patki commented on SPARK-24130:
---

Do we have any traction on this?

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27962:


Assignee: (was: Apache Spark)

> Propagate subprocess stdout when subprocess exits with nonzero status in 
> deploy.RRunner
> ---
>
> Key: SPARK-27962
> URL: https://issues.apache.org/jira/browse/SPARK-27962
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jeremy Liu
>Priority: Minor
>
> When the R process launched in org.apache.spark.deploy.RRunner terminates 
> with a nonzero status code, only the status code is passed on in the 
> SparkUserAppException.
> Although the subprocess' stdout is continually piped to System.out, it would 
> be useful for users without access to the JVM's stdout to also capture the 
> last few lines of the R process and pass it along in the exception message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27962:


Assignee: Apache Spark

> Propagate subprocess stdout when subprocess exits with nonzero status in 
> deploy.RRunner
> ---
>
> Key: SPARK-27962
> URL: https://issues.apache.org/jira/browse/SPARK-27962
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jeremy Liu
>Assignee: Apache Spark
>Priority: Minor
>
> When the R process launched in org.apache.spark.deploy.RRunner terminates 
> with a nonzero status code, only the status code is passed on in the 
> SparkUserAppException.
> Although the subprocess' stdout is continually piped to System.out, it would 
> be useful for users without access to the JVM's stdout to also capture the 
> last few lines of the R process and pass it along in the exception message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner

2019-06-05 Thread Jeremy Liu (JIRA)

Jeremy Liu created SPARK-27962:
--

 Summary: Propagate subprocess stdout when subprocess exits with 
nonzero status in deploy.RRunner
 Key: SPARK-27962
 URL: https://issues.apache.org/jira/browse/SPARK-27962
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 2.4.3
Reporter: Jeremy Liu


When the R process launched in org.apache.spark.deploy.RRunner terminates with 
a nonzero status code, only the status code is passed on in the 
SparkUserAppException.

Although the subprocess' stdout is continually piped to System.out, it would be 
useful for users without access to the JVM's stdout to also capture the last 
few lines of the R process and pass it along in the exception message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27760:


Assignee: Thomas Graves  (was: Apache Spark)

> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs (like .count and .unit) which seems more of a hassle for the 
> user.
> Note the yarn configs for resources use amount:  
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together. I think that makes a lot of sense. Filed a 
> separate Jira to add .amount to the yarn configs as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27760:


Assignee: Apache Spark  (was: Thomas Graves)

> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs (like .count and .unit) which seems more of a hassle for the 
> user.
> Note the yarn configs for resources use amount:  
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together. I think that makes a lot of sense. Filed a 
> separate Jira to add .amount to the yarn configs as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27961:
---
Description: 
The newly added `Refresh` method in [PR 
#24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.

  was:
The newly added `Refresh` method in PR #24401 prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.


> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856982#comment-16856982
 ] 

John Zhuge commented on SPARK-27961:


[~Gengliang.Wang] [~cloud_fan] Could you help?

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in PR #24401 prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)

John Zhuge created SPARK-27961:
--

 Summary: DataSourceV2Relation should not have refresh method
 Key: SPARK-27961
 URL: https://issues.apache.org/jira/browse/SPARK-27961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


The newly added `Refresh` method in PR #24401 prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT

2019-06-05 Thread Johannes Schaffrath (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856975#comment-16856975
 ] 

Johannes Schaffrath commented on SPARK-27939:
-

Hi Bryan,

thank you very much for the detailed information. I just saw that this is also 
mentioned in the documentation [1], but like you said it is not intuitive.

[1] 
http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.Row

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return

[jira] [Created] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

2019-06-05 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-27960:
-

 Summary: DataSourceV2 ORC implementation doesn't handle schemas 
correctly
 Key: SPARK-27960
 URL: https://issues.apache.org/jira/browse/SPARK-27960
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


While testing SPARK-27919 
(#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
ORC implementation to validate a v2 catalog that delegates to the session 
catalog. The ORC implementation fails the following test case because it cannot 
infer a schema (there is no data) but it should be using the schema used to 
create the table.

 Test case:
{code}
test("CreateTable: test ORC source") {
  spark.conf.set("spark.sql.catalog.session", classOf[V2SessionCatalog].getName)

  spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")

  val testCatalog = spark.catalog("session").asTableCatalog
  val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))

  assert(table.name == "orc ") // <-- should this be table_name?
  assert(table.partitioning.isEmpty)
  assert(table.properties == Map(
"provider" -> orc2,
"database" -> "default",
"table" -> "table_name").asJava)
  assert(table.schema == new StructType().add("id", LongType).add("data", 
StringType)) // <-- fail

  val rdd = 
spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
  checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
}
{code}

Error:
{code}
Unable to infer schema for ORC. It must be specified manually.;
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must 
be specified manually.;
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
at scala.Option.getOrElse(Option.scala:138)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
at 
org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

2019-06-05 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856955#comment-16856955
 ] 

Ryan Blue commented on SPARK-27960:
---

[~Gengliang.Wang], FYI

> DataSourceV2 ORC implementation doesn't handle schemas correctly
> 
>
> Key: SPARK-27960
> URL: https://issues.apache.org/jira/browse/SPARK-27960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> While testing SPARK-27919 
> (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
> ORC implementation to validate a v2 catalog that delegates to the session 
> catalog. The ORC implementation fails the following test case because it 
> cannot infer a schema (there is no data) but it should be using the schema 
> used to create the table.
>  Test case:
> {code}
> test("CreateTable: test ORC source") {
>   spark.conf.set("spark.sql.catalog.session", 
> classOf[V2SessionCatalog].getName)
>   spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")
>   val testCatalog = spark.catalog("session").asTableCatalog
>   val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))
>   assert(table.name == "orc ") // <-- should this be table_name?
>   assert(table.partitioning.isEmpty)
>   assert(table.properties == Map(
> "provider" -> orc2,
> "database" -> "default",
> "table" -> "table_name").asJava)
>   assert(table.schema == new StructType().add("id", LongType).add("data", 
> StringType)) // <-- fail
>   val rdd = 
> spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
>   checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
> }
> {code}
> Error:
> {code}
> Unable to infer schema for ORC. It must be specified manually.;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
>   at 
> org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21136:


Assignee: Apache Spark  (was: Yesheng Ma)

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21136:


Assignee: Yesheng Ma  (was: Apache Spark)

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Yesheng Ma
>Priority: Critical
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-24615:
-

Assignee: Thomas Graves  (was: Xingbo Jiang)

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests and they don't need to specify discovery scripts 
separately.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests. Besides CPU cores and memory, worker now also 
> considers resources in creating new executors or drivers.
> Example conf:
> {code}
> spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
> spark.driver.resource.gpu.count=4
> spark.worker.resource.gpu.count=1
> {code}
> In client mode, driver process is not launched by worker. So user can specify 
> driver resource discovery script. In cluster mode, if user still specify 
> driver resource discovery script, it is ignored with a warning.
> Supporting resource isolation is tricky because Spark worker doesn't know how 
> to isolate resources unless we hardcode some resource names like GPU support 
> in YARN, which is less ideal. Support resource isolation of multiple resource 
> types is even harder. In the first version, we will implement accelerator 
> support without resource isolation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests and they don't need to specify discovery scripts 
separately.

> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests and they don't need to specify discovery scripts 
> separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27760:
--
Description: 
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
(like .count and .unit) which seems more of a hassle for the user.

Note the yarn configs for resources use amount:  
spark.yarn.\{executor/driver/am}.resource=, where the  
is value and unit together. I think that makes a lot of sense. Filed a separate 
Jira to add .amount to the yarn configs as well.

  was:
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
which seems more of a hassle for the user.

Note the yarn configs for resources use amount:  
spark.yarn.\{executor/driver/am}.resource=, where the  
is value and unit together. I think that makes a lot of sense. Filed a separate 
Jira to add .amount to the yarn configs as well.


> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs (like .count and .unit) which seems more of a hassle for the 
> user.
> Note the yarn configs for resources use amount:  
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together. I think that makes a lot of sense. Filed a 
> separate Jira to add .amount to the yarn configs as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27760:
--
Description: 
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
which seems more of a hassle for the user.

Note the yarn configs for resources use amount:  
spark.yarn.\{executor/driver/am}.resource=, where the  
is value and unit together. I think that makes a lot of sense. Filed a separate 
Jira to add .amount to the yarn configs as well.

  was:
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
which seems more of a hassle for the user.


> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs which seems more of a hassle for the user.
> Note the yarn configs for resources use amount:  
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together. I think that makes a lot of sense. Filed a 
> separate Jira to add .amount to the yarn configs as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27760:
-

Assignee: Thomas Graves

> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs which seems more of a hassle for the user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27760:
--
Summary: Spark resources - user configs change .count to be .amount  (was: 
Spark resources - user configs change .count to be .amount, and yarn configs 
should match)

> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs which seems more of a hassle for the user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27959) Change YARN resource configs to use .amount

2019-06-05 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-27959:
-

 Summary: Change YARN resource configs to use .amount
 Key: SPARK-27959
 URL: https://issues.apache.org/jira/browse/SPARK-27959
 Project: Spark
  Issue Type: Story
  Components: YARN
Affects Versions: 3.0.0
Reporter: Thomas Graves


we are adding in generic resource support into spark where we have suffix for 
the amount of the resource so that we could support other configs. 

Spark on yarn already had added configs to request resources via the configs 
spark.yarn.\{executor/driver/am}.resource=, where the  
is value and unit together.  We should change those configs to have a .amount 
suffix on them to match the spark configs and to allow future configs to be 
more easily added. YARN itself already supports tags and attributes so if we 
want the user to be able to pass those from spark at some point having a suffix 
makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount, and yarn configs should match

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27760:
--
Description: 
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
which seems more of a hassle for the user.

  was:
For the Spark resources, we created the config

spark.\{driver/executor}.resource.\{resourceName}.count

I think we should change .count to be .amount. That more easily allows users to 
specify things with suffix like memory in a single config and they can combine 
the value and unit. Without this they would have to specify 2 separate configs 
which seems more of a hassle for the user.

Spark on yarn already had added configs to request resources via the configs 
spark.yarn.\{executor/driver/am}.resource=, where the  
is value and unit together.  We should change those configs to have a .amount 
suffix on them to match the spark configs and to allow future configs to be 
more easily added. YARN itself already supports tags and attributes so if we 
want the user to be able to pass those from spark at some point having a suffix 
makes sense.


> Spark resources - user configs change .count to be .amount, and yarn configs 
> should match
> -
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs which seems more of a hassle for the user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution

2019-06-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27933.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24781
[https://github.com/apache/spark/pull/24781]

> Extracting common purge "behaviour" to the parent StreamExecution
> -
>
> Key: SPARK-27933
> URL: https://issues.apache.org/jira/browse/SPARK-27933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Minor
> Fix For: 3.0.0
>
>
> Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution

2019-06-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27933:
-

Assignee: Jacek Laskowski

> Extracting common purge "behaviour" to the parent StreamExecution
> -
>
> Key: SPARK-27933
> URL: https://issues.apache.org/jira/browse/SPARK-27933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Minor
>
> Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-06-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27364.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-06-05 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856895#comment-16856895
 ] 

Thomas Graves commented on SPARK-27364:
---

User facing changes are all committed so going to close this.

 

A few changes from above. getResources was just called resources.  The driver 
config for standalone mode takes a json file rather then individual address 
configs. (spark.driver.resourceFile)

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27521) move data source v2 API to catalyst module

2019-06-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27521.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> move data source v2 API to catalyst module
> --
>
> Key: SPARK-27521
> URL: https://issues.apache.org/jira/browse/SPARK-27521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-06-05 Thread Ruben Berenguel (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856827#comment-16856827
 ] 

Ruben Berenguel commented on SPARK-25994:
-

Hi [~mju] I'd like to lend a hand if you feel like it (I've been following 
on-and-off the discussions and SPIPs for this, and currently use GraphFrames). 
Wouldn't mind helping with Python APIs (I'm somewhat familiar with the Python 
APIs and a bit of the internals, even if I'm not a frequent user of PySpark)

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27958:


Assignee: Apache Spark

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Assignee: Apache Spark
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27958:


Assignee: (was: Apache Spark)

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856810#comment-16856810
 ] 

Apache Spark commented on SPARK-27958:
--

User 'vinooganesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/24807

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Vinoo Ganesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856808#comment-16856808
 ] 

Vinoo Ganesh commented on SPARK-27958:
--

[https://github.com/apache/spark/pull/24807]

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Vinoo Ganesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856803#comment-16856803
 ] 

Vinoo Ganesh commented on SPARK-27958:
--

Putting up a PR shortly 

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2019-06-05 Thread Vinoo Ganesh (JIRA)

Vinoo Ganesh created SPARK-27958:


 Summary: Stopping a SparkSession should not always stop Spark 
Context
 Key: SPARK-27958
 URL: https://issues.apache.org/jira/browse/SPARK-27958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Vinoo Ganesh


Creating a ticket to track the discussion here: 
[http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]

Right now, stopping a SparkSession stops the underlying SparkContext. This 
behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27749) hadoop-3.2 support hive-thriftserver

2019-06-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27749.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> hadoop-3.2 support hive-thriftserver
> 
>
> Key: SPARK-27749
> URL: https://issues.apache.org/jira/browse/SPARK-27749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2019-06-05 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20286.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24704
[https://github.com/apache/spark/pull/24704]

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>Priority: Major
> Fix For: 3.0.0
>
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2019-06-05 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20286:


Assignee: Marcelo Vanzin

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
 
*Background*
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

*Design*

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because default constraint is part of column, so I think could reuse the 
metadata of StructField. The default constraint will cached by metadata of 
StructField.

*Tasks*

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because default constraint is part of column, so I think could reuse the 
metadata of StructField. The default constraint will cached by metadata of 
StructField.

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
>  
> *Background*
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> *Design*
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.
> The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
> *Tasks*
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27957) Display default constraint of column when running desc table.

2019-06-05 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27957:
--

 Summary: Display default constraint of column when running desc 
table.
 Key: SPARK-27957
 URL: https://issues.apache.org/jira/browse/SPARK-27957
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


This is a sub task to implement default constraint.

This Jira used to solve the issue that display default constraint when executing
{code:java}
desc table{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856595#comment-16856595
 ] 

Apache Spark commented on SPARK-27943:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/24372

> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.
> The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because default constraint is part of column, so I think could reuse the 
metadata of StructField. The default constraint will cached by metadata of 
StructField.

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because 

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.
> The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because 

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.
> The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because 
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2019-06-05 Thread Andrey Zinovyev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856592#comment-16856592
 ] 

Andrey Zinovyev commented on SPARK-27913:
-

Simple way to reproduce it


{code:sql}
create external table test_broken_orc(a struct) stored as orc;
insert into table test_broken_orc select named_struct("f1", 1);
drop table test_broken_orc;
create external table test_broken_orc(a struct) stored as orc;
select * from test_broken_orc;
{code}

Last statement fails with exception

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
{noformat}

Also you can remove column or add column in the middle of struct field. As far 
as I understand current implementation it supports by-name field resolution of 
zero level of orc structure. Everything deeper get resolved by index and 
expected be exact match with reader schema


> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27798:


Assignee: (was: Apache Spark)

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27798:


Assignee: Apache Spark

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27953) Save default constraint with Column into table properties when create Hive table

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27953:


Assignee: Apache Spark

> Save default constraint with Column into table properties when create Hive 
> table
> 
>
> Key: SPARK-27953
> URL: https://issues.apache.org/jira/browse/SPARK-27953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> This is a sub task to implement default constraint.
> This Jira  want solve the issue that save default constraint into properties 
> of Hive table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27953) Save default constraint with Column into table properties when create Hive table

2019-06-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856588#comment-16856588
 ] 

Apache Spark commented on SPARK-27953:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/24792

> Save default constraint with Column into table properties when create Hive 
> table
> 
>
> Key: SPARK-27953
> URL: https://issues.apache.org/jira/browse/SPARK-27953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> This is a sub task to implement default constraint.
> This Jira  want solve the issue that save default constraint into properties 
> of Hive table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27953) Save default constraint with Column into table properties when create Hive table

2019-06-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27953:


Assignee: (was: Apache Spark)

> Save default constraint with Column into table properties when create Hive 
> table
> 
>
> Key: SPARK-27953
> URL: https://issues.apache.org/jira/browse/SPARK-27953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> This is a sub task to implement default constraint.
> This Jira  want solve the issue that save default constraint into properties 
> of Hive table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27956) Allow subqueries as partition filter

2019-06-05 Thread Johannes Mayer (JIRA)

Johannes Mayer created SPARK-27956:
--

 Summary: Allow subqueries as partition filter
 Key: SPARK-27956
 URL: https://issues.apache.org/jira/browse/SPARK-27956
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Johannes Mayer


Subqueries are not pushed down as partition filters. See following example

 
{code:java}
create table user_mayerjoh.tab (c1 string)
partitioned by (c2 string)
stored as parquet;
{code}
 

 
{code:java}
explain select * from user_mayerjoh.tab where c2 < 1;{code}
 

  == Physical Plan ==

*(1) FileScan parquet user_mayerjoh.tab[c1#22,c2#23] Batched: true, Format: 
Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
*PartitionFilters: [isnotnull(c2#23), (cast(c2#23 as int) < 1)]*, 
PushedFilters: [], ReadSchema: struct

 

 
{code:java}
explain select * from user_mayerjoh.tab where c2 < (select 1);{code}
 

== Physical Plan ==

 

+- *(1) FileScan parquet user_mayerjoh.tab[c1#30,c2#31] Batched: true, Format: 
Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
*PartitionFilters: [isnotnull(c2#31)]*, PushedFilters: [], ReadSchema: 
struct

 

Is it possible to first execute the subquery and use the result as partition 
filter?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-05 Thread Piotr Chowaniec (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856558#comment-16856558
 ] 

Piotr Chowaniec commented on SPARK-18105:
-

I have a similar issue with Spark 2.3.2.

Here is a stack trace:
{code:java}
org.apache.spark.scheduler.DAGScheduler  : ShuffleMapStage 647 (count at 
Step.java:20) failed in 1.908 s due to 
org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:252)
at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:349)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436)
... 21 more
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 2010 of input 
buffer
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:247)
... 29 more
{code}
It happens during ETL process that has about 200 steps. It looks like it 
depends on the input data because we have exceptions only on the production 
environment (on test and dev machines same process with different data is 
running without problems). Unfortunately there is no way to use production data 
on other environment, so we cannot find differences. 

Changing compression codec to Snappy gives:
{code:java}
o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 852.3 (TID 308
36, localhost, executor driver): FetchFailed(BlockManagerId(driver, DNS.domena, 
33588, None), shuffleId=298, mapId=2, reduceId=3, message=
org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at

[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL

2019-06-05 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27923:

Description: 
In this ticket, we plan to list all cases that PostgreSQL throws an exception 
but Spark SQL is NULL.


When porting the 
[boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql]
 found a case:
# Cast unaccepted value to boolean type throws [invalid input 
syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47].

When porting the 
[case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql]
 found a case:
 # Division by zero [throws an 
exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99].

  was:
# {{SELECT bool 'test' AS error;}} 
[link|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47].
# {{SELECT 1/0 AS error;}} 
[link|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99].


> List all cases that PostgreSQL throws an exception but Spark SQL is NULL
> 
>
> Key: SPARK-27923
> URL: https://issues.apache.org/jira/browse/SPARK-27923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to list all cases that PostgreSQL throws an exception 
> but Spark SQL is NULL.
> When porting the 
> [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql]
>  found a case:
> # Cast unaccepted value to boolean type throws [invalid input 
> syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47].
> When porting the 
> [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql]
>  found a case:
>  # Division by zero [throws an 
> exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2019-06-05 Thread Iris Shaibsky (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856470#comment-16856470
 ] 

Iris Shaibsky commented on SPARK-25380:
---

We are facing it also on spark 2.4.2, I see that the PR is merged to master on 
March 13 , but was not included in spark 2.4.3 release. 

When this PR will be included in a release? 

Thanks!

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27462:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-27943

> Spark hive can not choose some columns in target table flexibly, when running 
> insert into.
> --
>
> Key: SPARK-27462
> URL: https://issues.apache.org/jira/browse/SPARK-27462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL can not support the feature to choose some columns in target table 
> flexibly, when running
> {code:java}
> insert into tableA select ... from tableB;{code}
> This feature is supported by Hive, so I think this grammar should be 
> consistent with Hive。
> Hive support some feature about 'insert into' as follows:
> {code:java}
> insert into gja_test_spark select * from gja_test;
> insert into gja_test_spark(key, value, other) select key, value, other from 
> gja_test;
> insert into gja_test_spark(key, value) select value, other from gja_test;
> insert into gja_test_spark(key, other) select value, other from gja_test;
> insert into gja_test_spark(value, other) select value, other from 
> gja_test;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27955) Update default constraint with Column into table properties when alter Hive table

2019-06-05 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27955:
--

 Summary: Update default constraint with Column into table 
properties when alter Hive table
 Key: SPARK-27955
 URL: https://issues.apache.org/jira/browse/SPARK-27955
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


This is a sub task to implement default constraint.

This Jira  want solve the issue that update default constraint into properties 
of Hive table after alter table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27954) Restore default constraint with Column from table properties after get metadata from Hive

2019-06-05 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27954:
--

 Summary: Restore default constraint with Column from table 
properties after get metadata from Hive
 Key: SPARK-27954
 URL: https://issues.apache.org/jira/browse/SPARK-27954
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


This is a sub task to implement default constraint.

This Jira  want solve the issue that restore default constraint from properties 
of Hive table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27953) Save default constraint with Column into table properties when create Hive table

2019-06-05 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27953:
--

 Summary: Save default constraint with Column into table properties 
when create Hive table
 Key: SPARK-27953
 URL: https://issues.apache.org/jira/browse/SPARK-27953
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


This is a sub task to implement default constraint.

This Jira  want solve the issue that save default constraint into properties of 
Hive table.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.
> The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27521) move data source v2 API to catalyst module

2019-06-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27521:

Labels: release-notes  (was: )

> move data source v2 API to catalyst module
> --
>
> Key: SPARK-27521
> URL: https://issues.apache.org/jira/browse/SPARK-27521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: release-notes
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement default constraint, but there are two points to 
> pay attention to in design:
> One is Spark SQL should reduce coupling with Hive. 
> Another is  default constraint could compatible with different versions of 
> Hive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.
> Spark SQL need to implement 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact. But Hive exists many version used in production 
> and the feature between each version are different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Summary: Implement default constraint with Column for Hive table  (was: Add 
default constraint when create hive table)

> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27943) Add default constraint when create hive table

2019-06-05 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.


> Add default constraint when create hive table
> -
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

84 matches

Mail list logo