[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2019-12-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989490#comment-16989490
 ] 

Dongjoon Hyun commented on SPARK-26091:
---

Apache Spark 2.4.4 is able to talk with your HiveMetastore 2.3.4 if you use 
2.3.3 instead of 2.3.4.

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29966) Add version method in TableCatalog to avoid load table twice

2019-12-05 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989486#comment-16989486
 ] 

Wenchen Fan commented on SPARK-29966:
-

This should be fixed by https://github.com/apache/spark/pull/26684

> Add version method in TableCatalog to avoid load table twice
> 
>
> Key: SPARK-29966
> URL: https://issues.apache.org/jira/browse/SPARK-29966
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> Now resolve logic plan will load table twice which are in ResolveTables and 
> ResolveRelations. The ResolveRelations is old code path, and ResolveTables is 
> v2 code path, and the reason why load table twice is that ResolveTables will 
> load table and rollback v1 table to ResolveRelations code path.
> The same scene also exists in ResolveSessionCatalog.
> It affect that execute command will cost double time than spark 2.4.
> Here is the idea that add a table version method in TableCatalog, and rules 
> should always get table version firstly without load table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29966) Add version method in TableCatalog to avoid load table twice

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29966.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> Add version method in TableCatalog to avoid load table twice
> 
>
> Key: SPARK-29966
> URL: https://issues.apache.org/jira/browse/SPARK-29966
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.0.0
>
>
> Now resolve logic plan will load table twice which are in ResolveTables and 
> ResolveRelations. The ResolveRelations is old code path, and ResolveTables is 
> v2 code path, and the reason why load table twice is that ResolveTables will 
> load table and rollback v1 table to ResolveRelations code path.
> The same scene also exists in ResolveSessionCatalog.
> It affect that execute command will cost double time than spark 2.4.
> Here is the idea that add a table version method in TableCatalog, and rules 
> should always get table version firstly without load table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30001) can't lookup v1 tables whose names specify the session catalog

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30001.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26684
[https://github.com/apache/spark/pull/26684]

> can't lookup v1 tables whose names specify the session catalog
> --
>
> Key: SPARK-30001
> URL: https://issues.apache.org/jira/browse/SPARK-30001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> A simple way to reproduce it
> {code}
> scala> sql("create table t using hive as select 1 as i")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from t").show
> +---+
> |  i|
> +---+
> |  1|
> +---+
> scala> sql("select * from spark_catalog.t").show
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> spark_catalog.t; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [spark_catalog, t]
> {code}
> The reason is that, we first go into `ResolveTables`, which lookups the table 
> successfully, but then give up because it's a v1 table. Next we go into 
> `ResolveRelations`, which do not recognize catalog name at all.
> Similar to https://issues.apache.org/jira/browse/SPARK-29966 , we should make 
> `ResolveRelations` responsible for lookup both v1 and v2 tables from the 
> session catalog, and correctly recognize catalog name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30001) can't lookup v1 tables whose names specify the session catalog

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30001:
---

Assignee: Terry Kim

> can't lookup v1 tables whose names specify the session catalog
> --
>
> Key: SPARK-30001
> URL: https://issues.apache.org/jira/browse/SPARK-30001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> A simple way to reproduce it
> {code}
> scala> sql("create table t using hive as select 1 as i")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from t").show
> +---+
> |  i|
> +---+
> |  1|
> +---+
> scala> sql("select * from spark_catalog.t").show
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> spark_catalog.t; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [spark_catalog, t]
> {code}
> The reason is that, we first go into `ResolveTables`, which lookups the table 
> successfully, but then give up because it's a v1 table. Next we go into 
> `ResolveRelations`, which do not recognize catalog name at all.
> Similar to https://issues.apache.org/jira/browse/SPARK-29966 , we should make 
> `ResolveRelations` responsible for lookup both v1 and v2 tables from the 
> session catalog, and correctly recognize catalog name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29966) avoid load table twice

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29966:

Summary: avoid load table twice  (was: Add version method in TableCatalog 
to avoid load table twice)

> avoid load table twice
> --
>
> Key: SPARK-29966
> URL: https://issues.apache.org/jira/browse/SPARK-29966
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.0.0
>
>
> Now resolve logic plan will load table twice which are in ResolveTables and 
> ResolveRelations. The ResolveRelations is old code path, and ResolveTables is 
> v2 code path, and the reason why load table twice is that ResolveTables will 
> load table and rollback v1 table to ResolveRelations code path.
> The same scene also exists in ResolveSessionCatalog.
> It affect that execute command will cost double time than spark 2.4.
> Here is the idea that add a table version method in TableCatalog, and rules 
> should always get table version firstly without load table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30067) A bug in getBlockHosts

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30067.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26650
[https://github.com/apache/spark/pull/26650]

> A bug in getBlockHosts
> --
>
> Key: SPARK-30067
> URL: https://issues.apache.org/jira/browse/SPARK-30067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: madianjun
>Assignee: madianjun
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is a bug in the getBlockHosts() function. In the case "The fragment 
> ends at a position within this block", the end of fragment should be before 
> the end of block,where the "end of block" means {{b.getOffset + 
> b.getLength}},not {{b.getLength}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30067) Fix fragment offset comparison in getBlockHosts

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30067:
--
Summary: Fix fragment offset comparison in getBlockHosts  (was: A bug in 
getBlockHosts)

> Fix fragment offset comparison in getBlockHosts
> ---
>
> Key: SPARK-30067
> URL: https://issues.apache.org/jira/browse/SPARK-30067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: madianjun
>Assignee: madianjun
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is a bug in the getBlockHosts() function. In the case "The fragment 
> ends at a position within this block", the end of fragment should be before 
> the end of block,where the "end of block" means {{b.getOffset + 
> b.getLength}},not {{b.getLength}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30067) A bug in getBlockHosts

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30067:
-

Assignee: madianjun

> A bug in getBlockHosts
> --
>
> Key: SPARK-30067
> URL: https://issues.apache.org/jira/browse/SPARK-30067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: madianjun
>Assignee: madianjun
>Priority: Minor
>
> There is a bug in the getBlockHosts() function. In the case "The fragment 
> ends at a position within this block", the end of fragment should be before 
> the end of block,where the "end of block" means {{b.getOffset + 
> b.getLength}},not {{b.getLength}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23534) Spark run on Hadoop 3.0.0

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23534.
---
Fix Version/s: 3.0.0
   Resolution: Done

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Fix For: 3.0.0
>
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-24590.
-

> Make Jenkins tests passed with hadoop 3 profile
> ---
>
> Key: SPARK-24590
> URL: https://issues.apache.org/jira/browse/SPARK-24590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, some tests are being failed with hadoop-3 profile.
> Given PR builder 
> (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it 
> reported:
> {code}
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in 
> spark conf
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet 
> relation with decimal column
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL
> org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's 
> statistics for partition
> org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze 
> table
> org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after 
> analyze table
> org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES 
> after analyze table
> org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.client.VersionsSuite.success sanity check
> org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved 
> 75 ms
> org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly)
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using 
> locale tr - caseSensitive true
> org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and 
> view with unicode columns and comment
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for non-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE with incompatible schema on Hive-compatible table
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is 
> a sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - orc
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - parquet
> org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: 
> read char/varchar column written by Hive
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile

2019-12-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989472#comment-16989472
 ] 

Dongjoon Hyun commented on SPARK-24590:
---

Thanks. Yes. This is superseded by the other JIRA.

> Make Jenkins tests passed with hadoop 3 profile
> ---
>
> Key: SPARK-24590
> URL: https://issues.apache.org/jira/browse/SPARK-24590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, some tests are being failed with hadoop-3 profile.
> Given PR builder 
> (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it 
> reported:
> {code}
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in 
> spark conf
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet 
> relation with decimal column
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL
> org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's 
> statistics for partition
> org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze 
> table
> org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after 
> analyze table
> org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES 
> after analyze table
> org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.client.VersionsSuite.success sanity check
> org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved 
> 75 ms
> org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly)
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using 
> locale tr - caseSensitive true
> org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and 
> view with unicode columns and comment
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for non-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE with incompatible schema on Hive-compatible table
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is 
> a sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - orc
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - parquet
> org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: 
> read char/varchar column written by Hive
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24590.
---
Resolution: Duplicate

> Make Jenkins tests passed with hadoop 3 profile
> ---
>
> Key: SPARK-24590
> URL: https://issues.apache.org/jira/browse/SPARK-24590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, some tests are being failed with hadoop-3 profile.
> Given PR builder 
> (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it 
> reported:
> {code}
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in 
> spark conf
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet 
> relation with decimal column
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL
> org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's 
> statistics for partition
> org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze 
> table
> org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after 
> analyze table
> org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES 
> after analyze table
> org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.client.VersionsSuite.success sanity check
> org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved 
> 75 ms
> org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly)
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using 
> locale tr - caseSensitive true
> org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and 
> view with unicode columns and comment
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for non-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE with incompatible schema on Hive-compatible table
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is 
> a sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - orc
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - parquet
> org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: 
> read char/varchar column written by Hive
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29957) Reset MiniKDC's default enctypes to fit jdk8/jdk11

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29957:
--
Description: 
Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.

-
Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911
In this jira, the author said to replace origin Apache Directory project which 
is not maintained (but not said it won't work well in jdk11) to Apache Kerby 
which is java binding(fit java version).

And in Flink: apache/flink#9622
Author show the reason why hadoop-2.7.2's MminiKdc failed with jdk11.
Because new encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11.
Spark with hadoop-2.7's MiniKdcdoes not support these encryption types and does 
not work well when these encryption types are enabled, which results in the 
authentication failure.

And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error 
message is read message stream failed, message can't match.

  was:
Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.


> Reset MiniKDC's default enctypes to fit jdk8/jdk11
> --
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.
> -
> Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911
> In this jira, the author said to replace origin Apache Directory project 
> which is not maintained (but not said it won't work well in jdk11) to Apache 
> Kerby which is java binding(fit java version).
> And in Flink: apache/flink#9622
> Author show the reason why hadoop-2.7.2's MminiKdc failed with jdk11.
> Because new encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11.
> Spark with hadoop-2.7's MiniKdcdoes not support these encryption types and 
> does not work well when these encryption types are enabled, which results in 
> the authentication failure.
> And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error 
> message is read message stream failed, message can't match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29957.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26594
[https://github.com/apache/spark/pull/26594]

> Bump MiniKdc to 3.2.0
> -
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29957:
-

Assignee: angerszhu

> Bump MiniKdc to 3.2.0
> -
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29957) Reset MiniKDC's default enctypes to fit jdk8/jdk11

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29957:
--
Summary: Reset MiniKDC's default enctypes to fit jdk8/jdk11  (was: Bump 
MiniKdc to 3.2.0)

> Reset MiniKDC's default enctypes to fit jdk8/jdk11
> --
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989461#comment-16989461
 ] 

t oo commented on SPARK-26091:
--

i just want to access hivemetaxtore 2.3.4

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29774.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26412
[https://github.com/apache/spark/pull/26412]

> Date and Timestamp type +/- null should be null as Postgres
> ---
>
> Key: SPARK-29774
> URL: https://issues.apache.org/jira/browse/SPARK-29774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:sql}
> postgres=# select timestamp '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> postgres=# select date '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29774:
---

Assignee: Kent Yao

> Date and Timestamp type +/- null should be null as Postgres
> ---
>
> Key: SPARK-29774
> URL: https://issues.apache.org/jira/browse/SPARK-29774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> {code:sql}
> postgres=# select timestamp '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> postgres=# select date '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30146) add setWeightCol to GBTs in PySpark

2019-12-05 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-30146:
--

 Summary: add setWeightCol to GBTs in PySpark
 Key: SPARK-30146
 URL: https://issues.apache.org/jira/browse/SPARK-30146
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: Huaxin Gao


add setWeightCol and setMinWeightFractionPerNode in Python side of 
GBTClassifier and GBTRegressor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29953) File stream source cleanup options may break a file sink output

2019-12-05 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-29953.
--
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> File stream source cleanup options may break a file sink output
> ---
>
> Key: SPARK-29953
> URL: https://issues.apache.org/jira/browse/SPARK-29953
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-20568 added options to file streaming source to clean up processed 
> files. However, when applying these options to a directory that was written 
> by a file streaming sink, it will make the directory not queryable any more 
> because we delete files from the directory but they are still tracked by file 
> sink logs.
> I think we should block the options if the input source is a file streaming 
> sink path (has "_spark_metadata" folder).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30145) sparkContext.addJar fails when file path contains spaces

2019-12-05 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989398#comment-16989398
 ] 

Ankit Raj Boudh commented on SPARK-30145:
-

I will raise PR for this today

> sparkContext.addJar fails when file path contains spaces
> 
>
> Key: SPARK-30145
> URL: https://issues.apache.org/jira/browse/SPARK-30145
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30145) sparkContext.addJar fails when file path contains spaces

2019-12-05 Thread ABHISHEK KUMAR GUPTA (Jira)
ABHISHEK KUMAR GUPTA created SPARK-30145:


 Summary: sparkContext.addJar fails when file path contains spaces
 Key: SPARK-30145
 URL: https://issues.apache.org/jira/browse/SPARK-30145
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell

2019-12-05 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29510:
-
Description: 
When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
Screen shot attached.

But from beeline job Group ID is set.

Steps:


{code:java}
create table customer(id int, name String, CName String, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico'); 
SELECT A.CName AS CustomerName1, B.CName AS CustomerName2, A.City FROM customer 
A, customer B WHERE A.id <> B.id AND A.City = B.City ORDER BY A.City;{code}
 

  was:
When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
Screen shot attached.

But from beeline job Group ID is set.

Steps:


{code:java}
create table customer(id int, name String, CName String, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico'); {code}
 


> JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
> 
>
> Key: SPARK-29510
> URL: https://issues.apache.org/jira/browse/SPARK-29510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: JobGroup1.png, JobGroup2.png, JobGroup3.png
>
>
> When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
> Screen shot attached.
> But from beeline job Group ID is set.
> Steps:
> {code:java}
> create table customer(id int, name String, CName String, address String, city 
> String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico'); 
> SELECT A.CName AS CustomerName1, B.CName AS CustomerName2, A.City FROM 
> customer A, customer B WHERE A.id <> B.id AND A.City = B.City ORDER BY 
> A.City;{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-12-05 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-24666:
---

Assignee: L. C. Hsieh

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Assignee: L. C. Hsieh
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
> ===
> Here are the code to reproduce the issue. You can download title.akas.tsv 
> from [https://datasets.imdbws.com/] and upload to hdfs.
>  
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.ml.feature.Word2Vec
> case class Sentences(name: String, words: Array[String])
> import spark.implicits._
> // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
> val dataset = spark.read
>   .option("header", "true").option("sep", "\t")
>   .option("quote", "").option("nullValue", "\\N")
>   .csv("/tmp/word2vec/title.akas.tsv")
>   .filter("region = 'US' or language = 'en'")
>   .select("title")
>   .as[String]
>   .map(s => Sentences(s, s.split(' ')))
>   .persist()
> println("Training model...")
> val word2Vec = new Word2Vec()
>   .setInputCol("words")
>   .setOutputCol("vector")
>   .setVectorSize(64)
>   .setWindowSize(4)
>   .setNumPartitions(50)
>   .setMinCount(5)
>   .setMaxIter(20)
> val model = word2Vec.fit(dataset)
> model.getVectors.show()
> {code}
> When set maxIter to 30, you will get the result.
> {code:java}
> scala> model.getVectors.show()
> +-++
> | word|  vector|
> +-++
> | Unspoken|[-Infinity,-Infin...|
> |   Talent|[Infinity,-Infini...|
> |Hourglass|[1.09657520526310...|
> |Nickelodeon's|[2.20436549446219...|
> |  Priests|[-1.9625896848389...|
> |Religion:|[-3.8815759928213...|
> |   Bu|[-7.9722236466752...|
> |  Totoro:|[-4.1829056206528...|
> | Trouble,|[2.51985378203136...|
> |   Hatter|[8.49108115961009...|
> |  '79|[-5.4560309784650...|
> | Vile|[-1.2059769646379...|
> | 9/11|[Infinity,-Infini...|
> |  Santino|[6.30405421282099...|
> |  Motives|[1.96207712570869...|
> |  '13|[-1.7641987324084...|
> |   Fierce|[-Infinity,Infini...|
> |   Stover|[5.10057474120744...|
> |  'It|[1.08629989605664...|
> |Butts|[Infinity,Infinit...|
> +-++
> only showing top 20 rows
> {code}
> In this case, set maxIter to 20 may not generate Infinity but very large 
> absolute values. It depends on the training data sample and other 
> configurations.
> {code:java}
> scala> model.getVectors.show(2,false)
> 

[jira] [Resolved] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-12-05 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-24666.
-
Fix Version/s: 3.1.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26722
[https://github.com/apache/spark/pull/26722]

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 2.4.5, 3.1.0
>
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
> ===
> Here are the code to reproduce the issue. You can download title.akas.tsv 
> from [https://datasets.imdbws.com/] and upload to hdfs.
>  
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.ml.feature.Word2Vec
> case class Sentences(name: String, words: Array[String])
> import spark.implicits._
> // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
> val dataset = spark.read
>   .option("header", "true").option("sep", "\t")
>   .option("quote", "").option("nullValue", "\\N")
>   .csv("/tmp/word2vec/title.akas.tsv")
>   .filter("region = 'US' or language = 'en'")
>   .select("title")
>   .as[String]
>   .map(s => Sentences(s, s.split(' ')))
>   .persist()
> println("Training model...")
> val word2Vec = new Word2Vec()
>   .setInputCol("words")
>   .setOutputCol("vector")
>   .setVectorSize(64)
>   .setWindowSize(4)
>   .setNumPartitions(50)
>   .setMinCount(5)
>   .setMaxIter(20)
> val model = word2Vec.fit(dataset)
> model.getVectors.show()
> {code}
> When set maxIter to 30, you will get the result.
> {code:java}
> scala> model.getVectors.show()
> +-++
> | word|  vector|
> +-++
> | Unspoken|[-Infinity,-Infin...|
> |   Talent|[Infinity,-Infini...|
> |Hourglass|[1.09657520526310...|
> |Nickelodeon's|[2.20436549446219...|
> |  Priests|[-1.9625896848389...|
> |Religion:|[-3.8815759928213...|
> |   Bu|[-7.9722236466752...|
> |  Totoro:|[-4.1829056206528...|
> | Trouble,|[2.51985378203136...|
> |   Hatter|[8.49108115961009...|
> |  '79|[-5.4560309784650...|
> | Vile|[-1.2059769646379...|
> | 9/11|[Infinity,-Infini...|
> |  Santino|[6.30405421282099...|
> |  Motives|[1.96207712570869...|
> |  '13|[-1.7641987324084...|
> |   Fierce|[-Infinity,Infini...|
> |   Stover|[5.10057474120744...|
> |  'It|[1.08629989605664...|
> |Butts|[Infinity,Infinit...|
> +-++
> only showing top 20 rows
> {code}
> In this case, set maxIter to 20 may not generate Infinity but very large 
> absolute values. It depends on the training data sample and other 
> configurations.
> {code:java}
> scala> model.getVectors.show(2,false)
> 

[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2019-12-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989271#comment-16989271
 ] 

Dongjoon Hyun commented on SPARK-26091:
---

Since SPARK-26091 was an improvement issue, we didn't backport this. Only bug 
fixes are allowed for backporting. Is there any problem with Apache Spark 
2.4.4? Or, do you just want to access HiveMetastore 2.4.5?

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989256#comment-16989256
 ] 

t oo commented on SPARK-26091:
--

can this go in spark 2.4.5 ?

 

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989255#comment-16989255
 ] 

t oo commented on SPARK-22860:
--

[~kabhwan] can this go in 2.4.5?

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989254#comment-16989254
 ] 

t oo commented on SPARK-23534:
--

close?

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989253#comment-16989253
 ] 

t oo commented on SPARK-24590:
--

close?

 

> Make Jenkins tests passed with hadoop 3 profile
> ---
>
> Key: SPARK-24590
> URL: https://issues.apache.org/jira/browse/SPARK-24590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, some tests are being failed with hadoop-3 profile.
> Given PR builder 
> (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it 
> reported:
> {code}
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in 
> spark conf
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet 
> relation with decimal column
> org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL
> org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's 
> statistics for partition
> org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze 
> table
> org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after 
> analyze table
> org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES 
> after analyze table
> org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.client.VersionsSuite.success sanity check
> org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved 
> 75 ms
> org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly)
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using 
> locale tr - caseSensitive true
> org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and 
> view with unicode columns and comment
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for non-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive-compatible DataSource tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE for Hive tables
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER 
> TABLE with incompatible schema on Hive-compatible table
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is 
> a sbt.testing.SuiteSelector)
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - orc
> org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from 
> a hive table with a new column - parquet
> org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: 
> read char/varchar column written by Hive
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989251#comment-16989251
 ] 

t oo edited comment on SPARK-5159 at 12/5/19 11:31 PM:
---

[~yumwang] does removal of hive fork solve this one?

 


was (Author: toopt4):
[~yumwang] does removal of hive fork soove this one?

 

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
>Priority: Major
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989251#comment-16989251
 ] 

t oo commented on SPARK-5159:
-

[~yumwang] does removal of hive fork soove this one?

 

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
>Priority: Major
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249
 ] 

t oo commented on SPARK-27750:
--

bump

 

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27821) Spark WebUI - show numbers of drivers/apps in waiting/submitted/killed/running state

2019-12-05 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989248#comment-16989248
 ] 

t oo commented on SPARK-27821:
--

duration of running drivers missing too

> Spark WebUI - show numbers of drivers/apps in 
> waiting/submitted/killed/running state
> 
>
> Key: SPARK-27821
> URL: https://issues.apache.org/jira/browse/SPARK-27821
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
> Attachments: webui.png
>
>
> The webui shows total number of apps/drivers in running/completed state. This 
> improvement is to show total number in following more fine-grained states: 
> waiting/submitted/killed/running/completed 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30144) MLP param map missing

2019-12-05 Thread Glen-Erik Cortes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen-Erik Cortes updated SPARK-30144:
-
Description: 
Param maps for fitted classifiers are available with all classifiers except for 
the MultilayerPerceptronClassifier.
  
 There is no way to track or know what parameters were best during a 
crossvalidation or which parameters were used for submodels.
  
{code:java}
{
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', 
doc='features column name'): 'features', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', 
doc='label column name'): 'fake_banknote', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='predictionCol', doc='prediction column name'): 'prediction', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='probabilityCol', doc='Column name for predicted class conditional 
probabilities. Note: Not all models output well-calibrated probability 
estimates! These probabilities should be treated as confidences, not precise 
probabilities'): 'probability', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 
'rawPrediction'}{code}
 
 GBTClassifier for example shows all parameters:
  
{code:java}
  {
Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If false, 
the algorithm will pass trees to executors to match instances with nodes. If 
true, the algorithm will cache node IDs for each instance. Caching can speed up 
training of deeper trees.'): False, 
Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set 
checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the 
cache will get checkpointed every 10 iterations. Note: this setting will be 
ignored if the checkpoint directory is not set in the SparkContext'): 10, 
Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', 
doc='The number of features to consider for splits at each tree node. Supported 
options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', 
Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features 
column name'): 'features', 
Param(parent='GBTClassifier_a0e77b3430aa', name='labelCol', doc='label column 
name'): 'fake_banknote', Param(parent='GBTClassifier_a0e77b3430aa', 
name='lossType', doc='Loss function which GBT tries to minimize 
(case-insensitive). Supported options: logistic'): 'logistic', 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max number of 
bins for discretizing continuous features. Must be >=2 and >= number of 
categories for any categorical feature.'): 8, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth 
of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', 
name='maxIter', doc='maximum number of iterations (>= 0)'): 20, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum 
memory in MB allocated to histogram aggregation.'): 256, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum 
information gain for a split to be considered at a tree node.'): 0.0, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', 
doc='Minimum number of instances each child must have after split. If a split 
causes the left or right child to have fewer than minInstancesPerNode, the 
split will be discarded as invalid. Should be >= 1.'): 1, 
Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', 
doc='prediction column name'): 'prediction', 
Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 
1234, 
Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step size 
(a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of 
each estimator.'): 0.1, 
Param(parent='GBTClassifier_a0e77b3430aa', name='subsamplingRate', 
doc='Fraction of the training data used for learning each decision tree, in 
range (0, 1].'): 1.0}{code}
 
 Full example notebook here:

[https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH]

  was:
Param maps for fitted classifiers are available with all classifiers except for 
the
 MultilayerPerceptronClassifier.
  
 There is no way to track or know what parameters were best during a 
crossvalidation or which parameters were used for submodels.
  
{code:java}
{
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', 
doc='features column name'): 'features', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', 
doc='label column name'): 'fake_banknote', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='predictionCol', doc='prediction column name'): 'prediction', 

[jira] [Updated] (SPARK-30144) MLP param map missing

2019-12-05 Thread Glen-Erik Cortes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen-Erik Cortes updated SPARK-30144:
-
Description: 
Param maps for fitted classifiers are available with all classifiers except for 
the
 MultilayerPerceptronClassifier.
  
 There is no way to track or know what parameters were best during a 
crossvalidation or which parameters were used for submodels.
  
{code:java}
{
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', 
doc='features column name'): 'features', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', 
doc='label column name'): 'fake_banknote', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='predictionCol', doc='prediction column name'): 'prediction', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='probabilityCol', doc='Column name for predicted class conditional 
probabilities. Note: Not all models output well-calibrated probability 
estimates! These probabilities should be treated as confidences, not precise 
probabilities'): 'probability', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 
'rawPrediction'}{code}
 
 GBTClassifier for example shows all parameters:
  
{code:java}
  {
Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If false, 
the algorithm will pass trees to executors to match instances with nodes. If 
true, the algorithm will cache node IDs for each instance. Caching can speed up 
training of deeper trees.'): False, 
Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set 
checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the 
cache will get checkpointed every 10 iterations. Note: this setting will be 
ignored if the checkpoint directory is not set in the SparkContext'): 10, 
Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', 
doc='The number of features to consider for splits at each tree node. Supported 
options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', 
Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features 
column name'): 'features', 
Param(parent='GBTClassifier_a0e77b3430aa', name='labelCol', doc='label column 
name'): 'fake_banknote', Param(parent='GBTClassifier_a0e77b3430aa', 
name='lossType', doc='Loss function which GBT tries to minimize 
(case-insensitive). Supported options: logistic'): 'logistic', 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max number of 
bins for discretizing continuous features. Must be >=2 and >= number of 
categories for any categorical feature.'): 8, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth 
of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', 
name='maxIter', doc='maximum number of iterations (>= 0)'): 20, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum 
memory in MB allocated to histogram aggregation.'): 256, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum 
information gain for a split to be considered at a tree node.'): 0.0, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', 
doc='Minimum number of instances each child must have after split. If a split 
causes the left or right child to have fewer than minInstancesPerNode, the 
split will be discarded as invalid. Should be >= 1.'): 1, 
Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', 
doc='prediction column name'): 'prediction', 
Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 
1234, 
Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step size 
(a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of 
each estimator.'): 0.1, 
Param(parent='GBTClassifier_a0e77b3430aa', name='subsamplingRate', 
doc='Fraction of the training data used for learning each decision tree, in 
range (0, 1].'): 1.0}{code}
 
 Full example notebook here:

[https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH]

  was:
Param maps for fitted classifiers are available with all classifiers except for 
the
MultilayerPerceptronClassifier.
 
There is no way to track or know what parameters were best during a 
crossvalidation or which parameters were used for submodels.
 
{code:java}
{Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='featuresCol', doc='features column name'): 'features', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', 
doc='label column name'): 'fake_banknote', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='predictionCol', doc='prediction column name'): 'prediction', 

[jira] [Created] (SPARK-30144) MLP param map missing

2019-12-05 Thread Glen-Erik Cortes (Jira)
Glen-Erik Cortes created SPARK-30144:


 Summary: MLP param map missing
 Key: SPARK-30144
 URL: https://issues.apache.org/jira/browse/SPARK-30144
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.4
Reporter: Glen-Erik Cortes


Param maps for fitted classifiers are available with all classifiers except for 
the
MultilayerPerceptronClassifier.
 
There is no way to track or know what parameters were best during a 
crossvalidation or which parameters were used for submodels.
 
{code:java}
{Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='featuresCol', doc='features column name'): 'features', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', 
doc='label column name'): 'fake_banknote', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='predictionCol', doc='prediction column name'): 'prediction', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='probabilityCol', doc='Column name for predicted class conditional 
probabilities. Note: Not all models output well-calibrated probability 
estimates! These probabilities should be treated as confidences, not precise 
probabilities'): 'probability', 
Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 
'rawPrediction'}{code}
 
GBTClassifier for example shows all parameters:
 
{code:java}
  {Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If 
false, the algorithm will pass trees to executors to match instances with 
nodes. If true, the algorithm will cache node IDs for each instance. Caching 
can speed up training of deeper trees.'): False, 
Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set 
checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the 
cache will get checkpointed every 10 iterations. Note: this setting will be 
ignored if the checkpoint directory is not set in the SparkContext'): 10, 
Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', 
doc='The number of features to consider for splits at each tree node. Supported 
options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', 
Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features 
column name'): 'features', Param(parent='GBTClassifier_a0e77b3430aa', 
name='labelCol', doc='label column name'): 'fake_banknote', 
Param(parent='GBTClassifier_a0e77b3430aa', name='lossType', doc='Loss function 
which GBT tries to minimize (case-insensitive). Supported options: logistic'): 
'logistic', Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max 
number of bins for discretizing continuous features. Must be >=2 and >= number 
of categories for any categorical feature.'): 8, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth 
of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', 
name='maxIter', doc='maximum number of iterations (>= 0)'): 20, 
Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum 
memory in MB allocated to histogram aggregation.'): 256, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum 
information gain for a split to be considered at a tree node.'): 0.0, 
Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', 
doc='Minimum number of instances each child must have after split. If a split 
causes the left or right child to have fewer than minInstancesPerNode, the 
split will be discarded as invalid. Should be >= 1.'): 1, 
Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', 
doc='prediction column name'): 'prediction', 
Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 
1234, Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step 
size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution 
of each estimator.'): 0.1, Param(parent='GBTClassifier_a0e77b3430aa', 
name='subsamplingRate', doc='Fraction of the training data used for learning 
each decision tree, in range (0, 1].'): 1.0}{code}
 
Full example notebook here:

https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-05 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30143:
---

 Summary: StreamingQuery.stop() should not block indefinitely
 Key: SPARK-30143
 URL: https://issues.apache.org/jira/browse/SPARK-30143
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Burak Yavuz


The stop() method on a Streaming Query awaits the termination of the stream 
execution thread. However, the stream execution thread may block forever 
depending on the streaming source implementation (like in Kafka, which runs 
UninterruptibleThreads).

This causes control flow applications to hang indefinitely as well. We'd like 
to introduce a timeout to stop the execution thread, so that the control flow 
thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30124) unnecessary persist in PythonMLLibAPI.scala

2019-12-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30124.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26758
[https://github.com/apache/spark/pull/26758]

> unnecessary persist in PythonMLLibAPI.scala
> ---
>
> Key: SPARK-30124
> URL: https://issues.apache.org/jira/browse/SPARK-30124
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Aman Omer
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30124) unnecessary persist in PythonMLLibAPI.scala

2019-12-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30124:


Assignee: Aman Omer

> unnecessary persist in PythonMLLibAPI.scala
> ---
>
> Key: SPARK-30124
> URL: https://issues.apache.org/jira/browse/SPARK-30124
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Aman Omer
>Assignee: Aman Omer
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30121) Fix memory usage in sbt build script

2019-12-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30121.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26757
[https://github.com/apache/spark/pull/26757]

> Fix memory usage in sbt build script
> 
>
> Key: SPARK-30121
> URL: https://issues.apache.org/jira/browse/SPARK-30121
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Trivial
> Fix For: 3.0.0
>
>
> 1. the default memory setting is missing in usage instructions 
> {code:java}
> ```
> build/sbt -h
> ```
> ```
> -mem  set memory options (default: , which is -Xms2048m 
> -Xmx2048m -XX:ReservedCodeCacheSize=256m)
> ```
> {code}
> 2. the Perm space is not needed anymore, since java7 is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30121) Fix memory usage in sbt build script

2019-12-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30121:


Assignee: Kent Yao

> Fix memory usage in sbt build script
> 
>
> Key: SPARK-30121
> URL: https://issues.apache.org/jira/browse/SPARK-30121
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Trivial
>
> 1. the default memory setting is missing in usage instructions 
> {code:java}
> ```
> build/sbt -h
> ```
> ```
> -mem  set memory options (default: , which is -Xms2048m 
> -Xmx2048m -XX:ReservedCodeCacheSize=256m)
> ```
> {code}
> 2. the Perm space is not needed anymore, since java7 is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30121) Fix memory usage in sbt build script

2019-12-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30121:
-
Priority: Trivial  (was: Minor)

> Fix memory usage in sbt build script
> 
>
> Key: SPARK-30121
> URL: https://issues.apache.org/jira/browse/SPARK-30121
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Kent Yao
>Priority: Trivial
>
> 1. the default memory setting is missing in usage instructions 
> {code:java}
> ```
> build/sbt -h
> ```
> ```
> -mem  set memory options (default: , which is -Xms2048m 
> -Xmx2048m -XX:ReservedCodeCacheSize=256m)
> ```
> {code}
> 2. the Perm space is not needed anymore, since java7 is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30142) Upgrade Maven to 3.6.3

2019-12-05 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30142:
-

 Summary: Upgrade Maven to 3.6.3
 Key: SPARK-30142
 URL: https://issues.apache.org/jira/browse/SPARK-30142
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to upgrade to Maven 3.6.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28961) Upgrade Maven to 3.6.2

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28961.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is fixed https://github.com/apache/spark/pull/25665

> Upgrade Maven to 3.6.2
> --
>
> Key: SPARK-28961
> URL: https://issues.apache.org/jira/browse/SPARK-28961
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> It looks like maven 3.6.1 is missing from the apache maven repo:
> [http://apache.claz.org/maven/maven-3/]
> This is causing PR build failures:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110045/console]
>  
>  exec: curl -s -L 
>  
> [https://www.apache.org/dyn/closer.lua?action=download=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz]
>  gzip: stdin: not in gzip format
>  tar: Child returned status 1
>  tar: Error is not recoverable: exiting now
>  Using `mvn` from path: 
> /home/jenkins/workspace/SparkPullRequestBuilder/build/apache-maven-3.6.1/bin/mvn
>  build/mvn: line 163: 
> /home/jenkins/workspace/SparkPullRequestBuilder/build/apache-maven-3.6.1/bin/mvn:
>  No such file or directory
>  Error while getting version string from Maven:
> h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30129) New auth engine does not keep client ID in TransportClient after auth

2019-12-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988982#comment-16988982
 ] 

Dongjoon Hyun commented on SPARK-30129:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/26764

> New auth engine does not keep client ID in TransportClient after auth
> -
>
> Key: SPARK-30129
> URL: https://issues.apache.org/jira/browse/SPARK-30129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Found a little bug when working on a feature; when auth is on, it's expected 
> that the {{TransportClient}} provides the authenticated ID of the client 
> (generally the app ID), but the new auth engine is not setting that 
> information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30129) New auth engine does not keep client ID in TransportClient after auth

2019-12-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30129:
--
Fix Version/s: 2.4.5

> New auth engine does not keep client ID in TransportClient after auth
> -
>
> Key: SPARK-30129
> URL: https://issues.apache.org/jira/browse/SPARK-30129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Found a little bug when working on a feature; when auth is on, it's expected 
> that the {{TransportClient}} provides the authenticated ID of the client 
> (generally the app ID), but the new auth engine is not setting that 
> information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30099:
---

Assignee: Aman Omer  (was: jobit mathew)

> Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
> 
>
> Key: SPARK-30099
> URL: https://issues.apache.org/jira/browse/SPARK-30099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark SQL 
>  explain extended select * from any non existing table shows duplicate 
> AnalysisExceptions.
> {code:java}
>  spark-sql>explain extended select * from wrong
> == Parsed Logical Plan ==
>  'Project [*]
>  +- 'UnresolvedRelation `wrong`
> == Analyzed Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Optimized Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Physical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  Time taken: 6.0 seconds, Fetched 1 row(s)
>  19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 
> row
>  (s)
>  spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming

2019-12-05 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988840#comment-16988840
 ] 

Wenchen Fan commented on SPARK-30099:
-

ah sorry I made a mistake. Fixed now.

> Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
> 
>
> Key: SPARK-30099
> URL: https://issues.apache.org/jira/browse/SPARK-30099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark SQL 
>  explain extended select * from any non existing table shows duplicate 
> AnalysisExceptions.
> {code:java}
>  spark-sql>explain extended select * from wrong
> == Parsed Logical Plan ==
>  'Project [*]
>  +- 'UnresolvedRelation `wrong`
> == Analyzed Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Optimized Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Physical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  Time taken: 6.0 seconds, Fetched 1 row(s)
>  19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 
> row
>  (s)
>  spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30141) Fix QueryTest.checkAnswer usage

2019-12-05 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-30141:
---

 Summary: Fix QueryTest.checkAnswer usage
 Key: SPARK-30141
 URL: https://issues.apache.org/jira/browse/SPARK-30141
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuv1up updated SPARK-30140:
---
Attachment: (was: 1575546368156.jpg)

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
>
> ignore...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuv1up updated SPARK-30140:
---
Description: ignore...  (was: !image-2019-12-05-19-44-08-141.png!

I think the red box is writen as transitivity.)

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
>
> ignore...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuv1up updated SPARK-30140:
---
Comment: was deleted

(was: The picture seems to be hanging.

The error in clean method of ClosureCleaner.scala.)

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
> Attachments: 1575546368156.jpg
>
>
> !image-2019-12-05-19-44-08-141.png!
> I think the red box is writen as transitivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuv1up updated SPARK-30140:
---
Attachment: 1575546368156.jpg

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
> Attachments: 1575546368156.jpg
>
>
> !image-2019-12-05-19-44-08-141.png!
> I think the red box is writen as transitivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988711#comment-16988711
 ] 

wuv1up commented on SPARK-30140:


The picture seems to be hanging.

The error in clean method of ClosureCleaner.scala.

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
>
> !image-2019-12-05-19-44-08-141.png!
> I think the red box is writen as transitivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30140) Code comment error

2019-12-05 Thread wuv1up (Jira)
wuv1up created SPARK-30140:
--

 Summary:  Code comment error
 Key: SPARK-30140
 URL: https://issues.apache.org/jira/browse/SPARK-30140
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: wuv1up


!image-2019-12-05-19-44-08-141.png!

I think the red box is writen as transitivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30139) get_json_object does not work correctly

2019-12-05 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988693#comment-16988693
 ] 

Rakesh Raushan commented on SPARK-30139:


I will look into this issue.

> get_json_object does not work correctly
> ---
>
> Key: SPARK-30139
> URL: https://issues.apache.org/jira/browse/SPARK-30139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Clemens Valiente
>Priority: Major
>
> according to documentation:
> [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-]
> get_json_object "Extracts json object from a json string based on json path 
> specified, and returns json string of the extracted json object. It will 
> return null if the input json string is invalid."
>  
> the following SQL snippet returns null even though it should return 'a'
> {code}
> select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], 
> $[?($.id==123)].value){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

2019-12-05 Thread sam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988676#comment-16988676
 ] 

sam commented on SPARK-30101:
-

[~kabhwan] [~cloud_fan] [~sowen]

> We may deal with it we strongly agree about needs for prioritizing this.

Oh great thanks.

I think part of the problem here is Google SEO is broken because it's Algorithm 
has been trained by RDD. Googling how to set parallism always gives 
`spark.default.parallelism`.  Even if you Google "set default parallelism 
dataset spark" it still doesn't take you to 
http://spark.apache.org/docs/latest/sql-performance-tuning.html

I think setting parallelism is indeed one of the most important things you 
would ever need to do in Spark, so yes making it easier to find this would be 
super helpful to the community.

> spark.sql.shuffle.partitions is not in Configuration docs, but a very 
> critical parameter
> 
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.
> ...
> According to below comments, it uses spark.sql.shuffle.partitions, which 
> needs documenting in configuration.
> > Default number of partitions in RDDs returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.
> in https://spark.apache.org/docs/latest/configuration.html should say
> > Default number of partitions in RDDs, but not DS/DF (see 
> > spark.sql.shuffle.partitions) returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30139) get_json_object does not work correctly

2019-12-05 Thread Clemens Valiente (Jira)
Clemens Valiente created SPARK-30139:


 Summary: get_json_object does not work correctly
 Key: SPARK-30139
 URL: https://issues.apache.org/jira/browse/SPARK-30139
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Clemens Valiente


according to documentation:

[https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-]

get_json_object "Extracts json object from a json string based on json path 
specified, and returns json string of the extracted json object. It will return 
null if the input json string is invalid."

 

the following SQL snippet returns null even though it should return 'a'

{code}
select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], 
$[?($.id==123)].value){code}

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming

2019-12-05 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988607#comment-16988607
 ] 

Aman Omer commented on SPARK-30099:
---

[~cloud_fan] can you assign this jira ticket to me?
id: aman_omer

> Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
> 
>
> Key: SPARK-30099
> URL: https://issues.apache.org/jira/browse/SPARK-30099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark SQL 
>  explain extended select * from any non existing table shows duplicate 
> AnalysisExceptions.
> {code:java}
>  spark-sql>explain extended select * from wrong
> == Parsed Logical Plan ==
>  'Project [*]
>  +- 'UnresolvedRelation `wrong`
> == Analyzed Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Optimized Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Physical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  Time taken: 6.0 seconds, Fetched 1 row(s)
>  19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 
> row
>  (s)
>  spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer

2019-12-05 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30138:
--
Description: 
Currently, both Analyzer and Optimizer use conf 
"spark.sql.optimizer.excludedRules" to set the max iterations to run, which is 
a little confusing.

It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for analyzer 
max iterations.

> Separate configuration key of max iterations for analyzer and optimizer
> ---
>
> Key: SPARK-30138
> URL: https://issues.apache.org/jira/browse/SPARK-30138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Currently, both Analyzer and Optimizer use conf 
> "spark.sql.optimizer.excludedRules" to set the max iterations to run, which 
> is a little confusing.
> It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for 
> analyzer max iterations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer

2019-12-05 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-30138:
-

 Summary: Separate configuration key of max iterations for analyzer 
and optimizer
 Key: SPARK-30138
 URL: https://issues.apache.org/jira/browse/SPARK-30138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29425) Alter database statement erases hive database's ownership

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29425.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26080
[https://github.com/apache/spark/pull/26080]

> Alter database statement erases hive database's ownership
> -
>
> Key: SPARK-29425
> URL: https://issues.apache.org/jira/browse/SPARK-29425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will 
> erase a hive database's owner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29425) Alter database statement erases hive database's ownership

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29425:
---

Assignee: Kent Yao

> Alter database statement erases hive database's ownership
> -
>
> Key: SPARK-29425
> URL: https://issues.apache.org/jira/browse/SPARK-29425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will 
> erase a hive database's owner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29860.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26485
[https://github.com/apache/spark/pull/26485]

> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
> Fix For: 3.0.0
>
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Exception information
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-12-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29860:
---

Assignee: feiwang

> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Exception information
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org