[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989490#comment-16989490 ] Dongjoon Hyun commented on SPARK-26091: --- Apache Spark 2.4.4 is able to talk with your HiveMetastore 2.3.4 if you use 2.3.3 instead of 2.3.4. > Upgrade to 2.3.4 for Hive Metastore Client 2.3 > -- > > Key: SPARK-26091 > URL: https://issues.apache.org/jira/browse/SPARK-26091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29966) Add version method in TableCatalog to avoid load table twice
[ https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989486#comment-16989486 ] Wenchen Fan commented on SPARK-29966: - This should be fixed by https://github.com/apache/spark/pull/26684 > Add version method in TableCatalog to avoid load table twice > > > Key: SPARK-29966 > URL: https://issues.apache.org/jira/browse/SPARK-29966 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > Now resolve logic plan will load table twice which are in ResolveTables and > ResolveRelations. The ResolveRelations is old code path, and ResolveTables is > v2 code path, and the reason why load table twice is that ResolveTables will > load table and rollback v1 table to ResolveRelations code path. > The same scene also exists in ResolveSessionCatalog. > It affect that execute command will cost double time than spark 2.4. > Here is the idea that add a table version method in TableCatalog, and rules > should always get table version firstly without load table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29966) Add version method in TableCatalog to avoid load table twice
[ https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29966. - Fix Version/s: 3.0.0 Assignee: Terry Kim Resolution: Fixed > Add version method in TableCatalog to avoid load table twice > > > Key: SPARK-29966 > URL: https://issues.apache.org/jira/browse/SPARK-29966 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: Terry Kim >Priority: Minor > Fix For: 3.0.0 > > > Now resolve logic plan will load table twice which are in ResolveTables and > ResolveRelations. The ResolveRelations is old code path, and ResolveTables is > v2 code path, and the reason why load table twice is that ResolveTables will > load table and rollback v1 table to ResolveRelations code path. > The same scene also exists in ResolveSessionCatalog. > It affect that execute command will cost double time than spark 2.4. > Here is the idea that add a table version method in TableCatalog, and rules > should always get table version firstly without load table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30001) can't lookup v1 tables whose names specify the session catalog
[ https://issues.apache.org/jira/browse/SPARK-30001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30001. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26684 [https://github.com/apache/spark/pull/26684] > can't lookup v1 tables whose names specify the session catalog > -- > > Key: SPARK-30001 > URL: https://issues.apache.org/jira/browse/SPARK-30001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > A simple way to reproduce it > {code} > scala> sql("create table t using hive as select 1 as i") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from t").show > +---+ > | i| > +---+ > | 1| > +---+ > scala> sql("select * from spark_catalog.t").show > org.apache.spark.sql.AnalysisException: Table or view not found: > spark_catalog.t; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [spark_catalog, t] > {code} > The reason is that, we first go into `ResolveTables`, which lookups the table > successfully, but then give up because it's a v1 table. Next we go into > `ResolveRelations`, which do not recognize catalog name at all. > Similar to https://issues.apache.org/jira/browse/SPARK-29966 , we should make > `ResolveRelations` responsible for lookup both v1 and v2 tables from the > session catalog, and correctly recognize catalog name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30001) can't lookup v1 tables whose names specify the session catalog
[ https://issues.apache.org/jira/browse/SPARK-30001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30001: --- Assignee: Terry Kim > can't lookup v1 tables whose names specify the session catalog > -- > > Key: SPARK-30001 > URL: https://issues.apache.org/jira/browse/SPARK-30001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > A simple way to reproduce it > {code} > scala> sql("create table t using hive as select 1 as i") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from t").show > +---+ > | i| > +---+ > | 1| > +---+ > scala> sql("select * from spark_catalog.t").show > org.apache.spark.sql.AnalysisException: Table or view not found: > spark_catalog.t; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [spark_catalog, t] > {code} > The reason is that, we first go into `ResolveTables`, which lookups the table > successfully, but then give up because it's a v1 table. Next we go into > `ResolveRelations`, which do not recognize catalog name at all. > Similar to https://issues.apache.org/jira/browse/SPARK-29966 , we should make > `ResolveRelations` responsible for lookup both v1 and v2 tables from the > session catalog, and correctly recognize catalog name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29966) avoid load table twice
[ https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29966: Summary: avoid load table twice (was: Add version method in TableCatalog to avoid load table twice) > avoid load table twice > -- > > Key: SPARK-29966 > URL: https://issues.apache.org/jira/browse/SPARK-29966 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: Terry Kim >Priority: Minor > Fix For: 3.0.0 > > > Now resolve logic plan will load table twice which are in ResolveTables and > ResolveRelations. The ResolveRelations is old code path, and ResolveTables is > v2 code path, and the reason why load table twice is that ResolveTables will > load table and rollback v1 table to ResolveRelations code path. > The same scene also exists in ResolveSessionCatalog. > It affect that execute command will cost double time than spark 2.4. > Here is the idea that add a table version method in TableCatalog, and rules > should always get table version firstly without load table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30067) A bug in getBlockHosts
[ https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30067. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26650 [https://github.com/apache/spark/pull/26650] > A bug in getBlockHosts > -- > > Key: SPARK-30067 > URL: https://issues.apache.org/jira/browse/SPARK-30067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: madianjun >Assignee: madianjun >Priority: Minor > Fix For: 3.0.0 > > > There is a bug in the getBlockHosts() function. In the case "The fragment > ends at a position within this block", the end of fragment should be before > the end of block,where the "end of block" means {{b.getOffset + > b.getLength}},not {{b.getLength}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30067) Fix fragment offset comparison in getBlockHosts
[ https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30067: -- Summary: Fix fragment offset comparison in getBlockHosts (was: A bug in getBlockHosts) > Fix fragment offset comparison in getBlockHosts > --- > > Key: SPARK-30067 > URL: https://issues.apache.org/jira/browse/SPARK-30067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: madianjun >Assignee: madianjun >Priority: Minor > Fix For: 3.0.0 > > > There is a bug in the getBlockHosts() function. In the case "The fragment > ends at a position within this block", the end of fragment should be before > the end of block,where the "end of block" means {{b.getOffset + > b.getLength}},not {{b.getLength}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30067) A bug in getBlockHosts
[ https://issues.apache.org/jira/browse/SPARK-30067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30067: - Assignee: madianjun > A bug in getBlockHosts > -- > > Key: SPARK-30067 > URL: https://issues.apache.org/jira/browse/SPARK-30067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: madianjun >Assignee: madianjun >Priority: Minor > > There is a bug in the getBlockHosts() function. In the case "The fragment > ends at a position within this block", the end of fragment should be before > the end of block,where the "end of block" means {{b.getOffset + > b.getLength}},not {{b.getLength}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-23534. --- Fix Version/s: 3.0.0 Resolution: Done > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Fix For: 3.0.0 > > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile
[ https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-24590. - > Make Jenkins tests passed with hadoop 3 profile > --- > > Key: SPARK-24590 > URL: https://issues.apache.org/jira/browse/SPARK-24590 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, some tests are being failed with hadoop-3 profile. > Given PR builder > (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it > reported: > {code} > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in > spark conf > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet > relation with decimal column > org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL > org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's > statistics for partition > org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze > table > org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after > analyze table > org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES > after analyze table > org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a > sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.client.VersionsSuite.success sanity check > org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved > 75 ms > org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly) > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using > locale tr - caseSensitive true > org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and > view with unicode columns and comment > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for non-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE with incompatible schema on Hive-compatible table > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is > a sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - orc > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - parquet > org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: > read char/varchar column written by Hive > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile
[ https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989472#comment-16989472 ] Dongjoon Hyun commented on SPARK-24590: --- Thanks. Yes. This is superseded by the other JIRA. > Make Jenkins tests passed with hadoop 3 profile > --- > > Key: SPARK-24590 > URL: https://issues.apache.org/jira/browse/SPARK-24590 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, some tests are being failed with hadoop-3 profile. > Given PR builder > (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it > reported: > {code} > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in > spark conf > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet > relation with decimal column > org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL > org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's > statistics for partition > org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze > table > org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after > analyze table > org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES > after analyze table > org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a > sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.client.VersionsSuite.success sanity check > org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved > 75 ms > org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly) > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using > locale tr - caseSensitive true > org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and > view with unicode columns and comment > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for non-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE with incompatible schema on Hive-compatible table > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is > a sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - orc > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - parquet > org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: > read char/varchar column written by Hive > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile
[ https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-24590. --- Resolution: Duplicate > Make Jenkins tests passed with hadoop 3 profile > --- > > Key: SPARK-24590 > URL: https://issues.apache.org/jira/browse/SPARK-24590 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, some tests are being failed with hadoop-3 profile. > Given PR builder > (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it > reported: > {code} > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in > spark conf > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet > relation with decimal column > org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL > org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's > statistics for partition > org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze > table > org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after > analyze table > org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES > after analyze table > org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a > sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.client.VersionsSuite.success sanity check > org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved > 75 ms > org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly) > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using > locale tr - caseSensitive true > org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and > view with unicode columns and comment > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for non-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE with incompatible schema on Hive-compatible table > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is > a sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - orc > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - parquet > org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: > read char/varchar column written by Hive > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29957) Reset MiniKDC's default enctypes to fit jdk8/jdk11
[ https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29957: -- Description: Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. New encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. - Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911 In this jira, the author said to replace origin Apache Directory project which is not maintained (but not said it won't work well in jdk11) to Apache Kerby which is java binding(fit java version). And in Flink: apache/flink#9622 Author show the reason why hadoop-2.7.2's MminiKdc failed with jdk11. Because new encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11. Spark with hadoop-2.7's MiniKdcdoes not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error message is read message stream failed, message can't match. was: Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. New encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. > Reset MiniKDC's default enctypes to fit jdk8/jdk11 > -- > > Key: SPARK-29957 > URL: https://issues.apache.org/jira/browse/SPARK-29957 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. > New encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support > these encryption types and does not work well when these encryption types are > enabled, which results in the authentication failure. > - > Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911 > In this jira, the author said to replace origin Apache Directory project > which is not maintained (but not said it won't work well in jdk11) to Apache > Kerby which is java binding(fit java version). > And in Flink: apache/flink#9622 > Author show the reason why hadoop-2.7.2's MminiKdc failed with jdk11. > Because new encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11. > Spark with hadoop-2.7's MiniKdcdoes not support these encryption types and > does not work well when these encryption types are enabled, which results in > the authentication failure. > And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error > message is read message stream failed, message can't match. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29957) Bump MiniKdc to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29957. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26594 [https://github.com/apache/spark/pull/26594] > Bump MiniKdc to 3.2.0 > - > > Key: SPARK-29957 > URL: https://issues.apache.org/jira/browse/SPARK-29957 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. > New encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support > these encryption types and does not work well when these encryption types are > enabled, which results in the authentication failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29957) Bump MiniKdc to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29957: - Assignee: angerszhu > Bump MiniKdc to 3.2.0 > - > > Key: SPARK-29957 > URL: https://issues.apache.org/jira/browse/SPARK-29957 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. > New encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support > these encryption types and does not work well when these encryption types are > enabled, which results in the authentication failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29957) Reset MiniKDC's default enctypes to fit jdk8/jdk11
[ https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29957: -- Summary: Reset MiniKDC's default enctypes to fit jdk8/jdk11 (was: Bump MiniKdc to 3.2.0) > Reset MiniKDC's default enctypes to fit jdk8/jdk11 > -- > > Key: SPARK-29957 > URL: https://issues.apache.org/jira/browse/SPARK-29957 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. > New encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support > these encryption types and does not work well when these encryption types are > enabled, which results in the authentication failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989461#comment-16989461 ] t oo commented on SPARK-26091: -- i just want to access hivemetaxtore 2.3.4 > Upgrade to 2.3.4 for Hive Metastore Client 2.3 > -- > > Key: SPARK-26091 > URL: https://issues.apache.org/jira/browse/SPARK-26091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres
[ https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29774. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26412 [https://github.com/apache/spark/pull/26412] > Date and Timestamp type +/- null should be null as Postgres > --- > > Key: SPARK-29774 > URL: https://issues.apache.org/jira/browse/SPARK-29774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > {code:sql} > postgres=# select timestamp '1999-12-31' - null; > ?column? > -- > (1 row) > postgres=# select date '1999-12-31' - null; > ?column? > -- > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres
[ https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29774: --- Assignee: Kent Yao > Date and Timestamp type +/- null should be null as Postgres > --- > > Key: SPARK-29774 > URL: https://issues.apache.org/jira/browse/SPARK-29774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > {code:sql} > postgres=# select timestamp '1999-12-31' - null; > ?column? > -- > (1 row) > postgres=# select date '1999-12-31' - null; > ?column? > -- > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30146) add setWeightCol to GBTs in PySpark
Huaxin Gao created SPARK-30146: -- Summary: add setWeightCol to GBTs in PySpark Key: SPARK-30146 URL: https://issues.apache.org/jira/browse/SPARK-30146 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: Huaxin Gao add setWeightCol and setMinWeightFractionPerNode in Python side of GBTClassifier and GBTRegressor -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29953) File stream source cleanup options may break a file sink output
[ https://issues.apache.org/jira/browse/SPARK-29953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-29953. -- Fix Version/s: 3.0.0 Assignee: Jungtaek Lim Resolution: Fixed > File stream source cleanup options may break a file sink output > --- > > Key: SPARK-29953 > URL: https://issues.apache.org/jira/browse/SPARK-29953 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-20568 added options to file streaming source to clean up processed > files. However, when applying these options to a directory that was written > by a file streaming sink, it will make the directory not queryable any more > because we delete files from the directory but they are still tracked by file > sink logs. > I think we should block the options if the input source is a file streaming > sink path (has "_spark_metadata" folder). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30145) sparkContext.addJar fails when file path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-30145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989398#comment-16989398 ] Ankit Raj Boudh commented on SPARK-30145: - I will raise PR for this today > sparkContext.addJar fails when file path contains spaces > > > Key: SPARK-30145 > URL: https://issues.apache.org/jira/browse/SPARK-30145 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30145) sparkContext.addJar fails when file path contains spaces
ABHISHEK KUMAR GUPTA created SPARK-30145: Summary: sparkContext.addJar fails when file path contains spaces Key: SPARK-30145 URL: https://issues.apache.org/jira/browse/SPARK-30145 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
[ https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29510: - Description: When user submit jobs from Spark-shell or SQL Job group id is not set. UI Screen shot attached. But from beeline job Group ID is set. Steps: {code:java} create table customer(id int, name String, CName String, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); SELECT A.CName AS CustomerName1, B.CName AS CustomerName2, A.City FROM customer A, customer B WHERE A.id <> B.id AND A.City = B.City ORDER BY A.City;{code} was: When user submit jobs from Spark-shell or SQL Job group id is not set. UI Screen shot attached. But from beeline job Group ID is set. Steps: {code:java} create table customer(id int, name String, CName String, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); {code} > JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell > > > Key: SPARK-29510 > URL: https://issues.apache.org/jira/browse/SPARK-29510 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: JobGroup1.png, JobGroup2.png, JobGroup3.png > > > When user submit jobs from Spark-shell or SQL Job group id is not set. UI > Screen shot attached. > But from beeline job Group ID is set. > Steps: > {code:java} > create table customer(id int, name String, CName String, address String, city > String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); > SELECT A.CName AS CustomerName1, B.CName AS CustomerName2, A.City FROM > customer A, customer B WHERE A.id <> B.id AND A.City = B.City ORDER BY > A.City;{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large
[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-24666: --- Assignee: L. C. Hsieh > Word2Vec generate infinity vectors when numIterations are large > --- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.1, 2.4.4 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X >Reporter: ZhongYu >Assignee: L. C. Hsieh >Priority: Critical > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > === > Here are the code to reproduce the issue. You can download title.akas.tsv > from [https://datasets.imdbws.com/] and upload to hdfs. > > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.ml.feature.Word2Vec > case class Sentences(name: String, words: Array[String]) > import spark.implicits._ > // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/ > val dataset = spark.read > .option("header", "true").option("sep", "\t") > .option("quote", "").option("nullValue", "\\N") > .csv("/tmp/word2vec/title.akas.tsv") > .filter("region = 'US' or language = 'en'") > .select("title") > .as[String] > .map(s => Sentences(s, s.split(' '))) > .persist() > println("Training model...") > val word2Vec = new Word2Vec() > .setInputCol("words") > .setOutputCol("vector") > .setVectorSize(64) > .setWindowSize(4) > .setNumPartitions(50) > .setMinCount(5) > .setMaxIter(20) > val model = word2Vec.fit(dataset) > model.getVectors.show() > {code} > When set maxIter to 30, you will get the result. > {code:java} > scala> model.getVectors.show() > +-++ > | word| vector| > +-++ > | Unspoken|[-Infinity,-Infin...| > | Talent|[Infinity,-Infini...| > |Hourglass|[1.09657520526310...| > |Nickelodeon's|[2.20436549446219...| > | Priests|[-1.9625896848389...| > |Religion:|[-3.8815759928213...| > | Bu|[-7.9722236466752...| > | Totoro:|[-4.1829056206528...| > | Trouble,|[2.51985378203136...| > | Hatter|[8.49108115961009...| > | '79|[-5.4560309784650...| > | Vile|[-1.2059769646379...| > | 9/11|[Infinity,-Infini...| > | Santino|[6.30405421282099...| > | Motives|[1.96207712570869...| > | '13|[-1.7641987324084...| > | Fierce|[-Infinity,Infini...| > | Stover|[5.10057474120744...| > | 'It|[1.08629989605664...| > |Butts|[Infinity,Infinit...| > +-++ > only showing top 20 rows > {code} > In this case, set maxIter to 20 may not generate Infinity but very large > absolute values. It depends on the training data sample and other > configurations. > {code:java} > scala> model.getVectors.show(2,false) >
[jira] [Resolved] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large
[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-24666. - Fix Version/s: 3.1.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26722 [https://github.com/apache/spark/pull/26722] > Word2Vec generate infinity vectors when numIterations are large > --- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.1, 2.4.4 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X >Reporter: ZhongYu >Assignee: L. C. Hsieh >Priority: Critical > Fix For: 2.4.5, 3.1.0 > > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > === > Here are the code to reproduce the issue. You can download title.akas.tsv > from [https://datasets.imdbws.com/] and upload to hdfs. > > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.ml.feature.Word2Vec > case class Sentences(name: String, words: Array[String]) > import spark.implicits._ > // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/ > val dataset = spark.read > .option("header", "true").option("sep", "\t") > .option("quote", "").option("nullValue", "\\N") > .csv("/tmp/word2vec/title.akas.tsv") > .filter("region = 'US' or language = 'en'") > .select("title") > .as[String] > .map(s => Sentences(s, s.split(' '))) > .persist() > println("Training model...") > val word2Vec = new Word2Vec() > .setInputCol("words") > .setOutputCol("vector") > .setVectorSize(64) > .setWindowSize(4) > .setNumPartitions(50) > .setMinCount(5) > .setMaxIter(20) > val model = word2Vec.fit(dataset) > model.getVectors.show() > {code} > When set maxIter to 30, you will get the result. > {code:java} > scala> model.getVectors.show() > +-++ > | word| vector| > +-++ > | Unspoken|[-Infinity,-Infin...| > | Talent|[Infinity,-Infini...| > |Hourglass|[1.09657520526310...| > |Nickelodeon's|[2.20436549446219...| > | Priests|[-1.9625896848389...| > |Religion:|[-3.8815759928213...| > | Bu|[-7.9722236466752...| > | Totoro:|[-4.1829056206528...| > | Trouble,|[2.51985378203136...| > | Hatter|[8.49108115961009...| > | '79|[-5.4560309784650...| > | Vile|[-1.2059769646379...| > | 9/11|[Infinity,-Infini...| > | Santino|[6.30405421282099...| > | Motives|[1.96207712570869...| > | '13|[-1.7641987324084...| > | Fierce|[-Infinity,Infini...| > | Stover|[5.10057474120744...| > | 'It|[1.08629989605664...| > |Butts|[Infinity,Infinit...| > +-++ > only showing top 20 rows > {code} > In this case, set maxIter to 20 may not generate Infinity but very large > absolute values. It depends on the training data sample and other > configurations. > {code:java} > scala> model.getVectors.show(2,false) >
[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989271#comment-16989271 ] Dongjoon Hyun commented on SPARK-26091: --- Since SPARK-26091 was an improvement issue, we didn't backport this. Only bug fixes are allowed for backporting. Is there any problem with Apache Spark 2.4.4? Or, do you just want to access HiveMetastore 2.4.5? > Upgrade to 2.3.4 for Hive Metastore Client 2.3 > -- > > Key: SPARK-26091 > URL: https://issues.apache.org/jira/browse/SPARK-26091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989256#comment-16989256 ] t oo commented on SPARK-26091: -- can this go in spark 2.4.5 ? > Upgrade to 2.3.4 for Hive Metastore Client 2.3 > -- > > Key: SPARK-26091 > URL: https://issues.apache.org/jira/browse/SPARK-26091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors
[ https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989255#comment-16989255 ] t oo commented on SPARK-22860: -- [~kabhwan] can this go in 2.4.5? > Spark workers log ssl passwords passed to the executors > --- > > Key: SPARK-22860 > URL: https://issues.apache.org/jira/browse/SPARK-22860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Felix K. >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > The workers log the spark.ssl.keyStorePassword and > spark.ssl.trustStorePassword passed by cli to the executor processes. The > ExecutorRunner should escape passwords to not appear in the worker's log > files in INFO level. In this example, you can see my 'SuperSecretPassword' in > a worker log: > {code} > 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: > "/global/myapp/oem/jdk/bin/java" "-cp" > "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar > [...] > :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" > "-Dspark.authenticate.enableSaslEncryption=true" > "-Dspark.ssl.keyStorePassword=SuperSecretPassword" > "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" > "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" > "-Dspark.ssl.protocol=TLS" > "-Dspark.ssl.trustStorePassword=SuperSecretPassword" > "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" > "-Dmyapp.config.directory=/global/myapp/application/config" > "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer" > > "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-XX:+UseG1GC" "-XX:+UseStringDeduplication" > "-Dthings.loader.export.zzz_files=false" > "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties" > "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" > "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" > "--worker-url" "spark://Worker@192.168.0.1:59530" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989254#comment-16989254 ] t oo commented on SPARK-23534: -- close? > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24590) Make Jenkins tests passed with hadoop 3 profile
[ https://issues.apache.org/jira/browse/SPARK-24590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989253#comment-16989253 ] t oo commented on SPARK-24590: -- close? > Make Jenkins tests passed with hadoop 3 profile > --- > > Key: SPARK-24590 > URL: https://issues.apache.org/jira/browse/SPARK-24590 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, some tests are being failed with hadoop-3 profile. > Given PR builder > (https://github.com/apache/spark/pull/21441#issuecomment-397818337), it > reported: > {code} > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8020: set sql conf in > spark conf > org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet > relation with decimal column > org.apache.spark.sql.hive.HiveSparkSubmitSuite.ConnectionURL > org.apache.spark.sql.hive.StatisticsSuite.SPARK-22745 - read Hive's > statistics for partition > org.apache.spark.sql.hive.StatisticsSuite.alter table rename after analyze > table > org.apache.spark.sql.hive.StatisticsSuite.alter table SET TBLPROPERTIES after > analyze table > org.apache.spark.sql.hive.StatisticsSuite.alter table UNSET TBLPROPERTIES > after analyze table > org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a > sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.client.VersionsSuite.success sanity check > org.apache.spark.sql.hive.client.VersionsSuite.hadoop configuration preserved > 75 ms > org.apache.spark.sql.hive.client.VersionsSuite.*: * (roughly) > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.basic DDL using > locale tr - caseSensitive true > org.apache.spark.sql.hive.execution.HiveDDLSuite.create Hive-serde table and > view with unicode columns and comment > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for non-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive-compatible DataSource tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE for Hive tables > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.SPARK-21617: ALTER > TABLE with incompatible schema on Hive-compatible table > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.(It is not a test it is > a sbt.testing.SuiteSelector) > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - orc > org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from > a hive table with a new column - parquet > org.apache.spark.sql.hive.orc.HiveOrcSourceSuite.SPARK-19459/SPARK-18220: > read char/varchar column written by Hive > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989251#comment-16989251 ] t oo edited comment on SPARK-5159 at 12/5/19 11:31 PM: --- [~yumwang] does removal of hive fork solve this one? was (Author: toopt4): [~yumwang] does removal of hive fork soove this one? > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray >Priority: Major > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989251#comment-16989251 ] t oo commented on SPARK-5159: - [~yumwang] does removal of hive fork soove this one? > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray >Priority: Major > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service
[ https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249 ] t oo commented on SPARK-27750: -- bump > Standalone scheduler - ability to prioritize applications over drivers, many > drivers act like Denial of Service > --- > > Key: SPARK-27750 > URL: https://issues.apache.org/jira/browse/SPARK-27750 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: t oo >Priority: Minor > > If I submit 1000 spark submit drivers then they consume all the cores on my > cluster (essentially it acts like a Denial of Service) and no spark > 'application' gets to run since the cores are all consumed by the 'drivers'. > This feature is about having the ability to prioritize applications over > drivers so that at least some 'applications' can start running. I guess it > would be like: If (driver.state = 'submitted' and (exists some app.state = > 'submitted')) then set app.state = 'running' > if all apps have app.state = 'running' then set driver.state = 'submitted' > > Secondary to this, why must a driver consume a minimum of 1 entire core? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27821) Spark WebUI - show numbers of drivers/apps in waiting/submitted/killed/running state
[ https://issues.apache.org/jira/browse/SPARK-27821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989248#comment-16989248 ] t oo commented on SPARK-27821: -- duration of running drivers missing too > Spark WebUI - show numbers of drivers/apps in > waiting/submitted/killed/running state > > > Key: SPARK-27821 > URL: https://issues.apache.org/jira/browse/SPARK-27821 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: t oo >Priority: Minor > Attachments: webui.png > > > The webui shows total number of apps/drivers in running/completed state. This > improvement is to show total number in following more fine-grained states: > waiting/submitted/killed/running/completed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30144) MLP param map missing
[ https://issues.apache.org/jira/browse/SPARK-30144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen-Erik Cortes updated SPARK-30144: - Description: Param maps for fitted classifiers are available with all classifiers except for the MultilayerPerceptronClassifier. There is no way to track or know what parameters were best during a crossvalidation or which parameters were used for submodels. {code:java} { Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', doc='features column name'): 'features', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'probability', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction'}{code} GBTClassifier for example shows all parameters: {code:java} { Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.'): False, Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext'): 10, Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features column name'): 'features', Param(parent='GBTClassifier_a0e77b3430aa', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='GBTClassifier_a0e77b3430aa', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic', Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 8, Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', name='maxIter', doc='maximum number of iterations (>= 0)'): 20, Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.'): 256, Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0, Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1, Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 1234, Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1, Param(parent='GBTClassifier_a0e77b3430aa', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0}{code} Full example notebook here: [https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH] was: Param maps for fitted classifiers are available with all classifiers except for the MultilayerPerceptronClassifier. There is no way to track or know what parameters were best during a crossvalidation or which parameters were used for submodels. {code:java} { Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', doc='features column name'): 'features', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='predictionCol', doc='prediction column name'): 'prediction',
[jira] [Updated] (SPARK-30144) MLP param map missing
[ https://issues.apache.org/jira/browse/SPARK-30144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen-Erik Cortes updated SPARK-30144: - Description: Param maps for fitted classifiers are available with all classifiers except for the MultilayerPerceptronClassifier. There is no way to track or know what parameters were best during a crossvalidation or which parameters were used for submodels. {code:java} { Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', doc='features column name'): 'features', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'probability', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction'}{code} GBTClassifier for example shows all parameters: {code:java} { Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.'): False, Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext'): 10, Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features column name'): 'features', Param(parent='GBTClassifier_a0e77b3430aa', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='GBTClassifier_a0e77b3430aa', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic', Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 8, Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', name='maxIter', doc='maximum number of iterations (>= 0)'): 20, Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.'): 256, Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0, Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1, Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 1234, Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1, Param(parent='GBTClassifier_a0e77b3430aa', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0}{code} Full example notebook here: [https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH] was: Param maps for fitted classifiers are available with all classifiers except for the MultilayerPerceptronClassifier. There is no way to track or know what parameters were best during a crossvalidation or which parameters were used for submodels. {code:java} {Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', doc='features column name'): 'features', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='predictionCol', doc='prediction column name'): 'prediction',
[jira] [Created] (SPARK-30144) MLP param map missing
Glen-Erik Cortes created SPARK-30144: Summary: MLP param map missing Key: SPARK-30144 URL: https://issues.apache.org/jira/browse/SPARK-30144 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.4.4 Reporter: Glen-Erik Cortes Param maps for fitted classifiers are available with all classifiers except for the MultilayerPerceptronClassifier. There is no way to track or know what parameters were best during a crossvalidation or which parameters were used for submodels. {code:java} {Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='featuresCol', doc='features column name'): 'features', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities'): 'probability', Param(parent='MultilayerPerceptronClassifier_eeab0cc242d1', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name'): 'rawPrediction'}{code} GBTClassifier for example shows all parameters: {code:java} {Param(parent='GBTClassifier_a0e77b3430aa', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.'): False, Param(parent='GBTClassifier_a0e77b3430aa', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext'): 10, Param(parent='GBTClassifier_a0e77b3430aa', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].'): 'all', Param(parent='GBTClassifier_a0e77b3430aa', name='featuresCol', doc='features column name'): 'features', Param(parent='GBTClassifier_a0e77b3430aa', name='labelCol', doc='label column name'): 'fake_banknote', Param(parent='GBTClassifier_a0e77b3430aa', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic', Param(parent='GBTClassifier_a0e77b3430aa', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 8, Param(parent='GBTClassifier_a0e77b3430aa', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5, Param(parent='GBTClassifier_a0e77b3430aa', name='maxIter', doc='maximum number of iterations (>= 0)'): 20, Param(parent='GBTClassifier_a0e77b3430aa', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.'): 256, Param(parent='GBTClassifier_a0e77b3430aa', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0, Param(parent='GBTClassifier_a0e77b3430aa', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1, Param(parent='GBTClassifier_a0e77b3430aa', name='predictionCol', doc='prediction column name'): 'prediction', Param(parent='GBTClassifier_a0e77b3430aa', name='seed', doc='random seed'): 1234, Param(parent='GBTClassifier_a0e77b3430aa', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1, Param(parent='GBTClassifier_a0e77b3430aa', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0}{code} Full example notebook here: https://colab.research.google.com/drive/1lwSHioZKlLh96FhGkdYFe6FUuRfTcSxH -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
Burak Yavuz created SPARK-30143: --- Summary: StreamingQuery.stop() should not block indefinitely Key: SPARK-30143 URL: https://issues.apache.org/jira/browse/SPARK-30143 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.4 Reporter: Burak Yavuz The stop() method on a Streaming Query awaits the termination of the stream execution thread. However, the stream execution thread may block forever depending on the streaming source implementation (like in Kafka, which runs UninterruptibleThreads). This causes control flow applications to hang indefinitely as well. We'd like to introduce a timeout to stop the execution thread, so that the control flow thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30124) unnecessary persist in PythonMLLibAPI.scala
[ https://issues.apache.org/jira/browse/SPARK-30124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30124. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26758 [https://github.com/apache/spark/pull/26758] > unnecessary persist in PythonMLLibAPI.scala > --- > > Key: SPARK-30124 > URL: https://issues.apache.org/jira/browse/SPARK-30124 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Aman Omer >Assignee: Aman Omer >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30124) unnecessary persist in PythonMLLibAPI.scala
[ https://issues.apache.org/jira/browse/SPARK-30124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30124: Assignee: Aman Omer > unnecessary persist in PythonMLLibAPI.scala > --- > > Key: SPARK-30124 > URL: https://issues.apache.org/jira/browse/SPARK-30124 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Aman Omer >Assignee: Aman Omer >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30121) Fix memory usage in sbt build script
[ https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30121. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26757 [https://github.com/apache/spark/pull/26757] > Fix memory usage in sbt build script > > > Key: SPARK-30121 > URL: https://issues.apache.org/jira/browse/SPARK-30121 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Trivial > Fix For: 3.0.0 > > > 1. the default memory setting is missing in usage instructions > {code:java} > ``` > build/sbt -h > ``` > ``` > -mem set memory options (default: , which is -Xms2048m > -Xmx2048m -XX:ReservedCodeCacheSize=256m) > ``` > {code} > 2. the Perm space is not needed anymore, since java7 is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30121) Fix memory usage in sbt build script
[ https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30121: Assignee: Kent Yao > Fix memory usage in sbt build script > > > Key: SPARK-30121 > URL: https://issues.apache.org/jira/browse/SPARK-30121 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Trivial > > 1. the default memory setting is missing in usage instructions > {code:java} > ``` > build/sbt -h > ``` > ``` > -mem set memory options (default: , which is -Xms2048m > -Xmx2048m -XX:ReservedCodeCacheSize=256m) > ``` > {code} > 2. the Perm space is not needed anymore, since java7 is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30121) Fix memory usage in sbt build script
[ https://issues.apache.org/jira/browse/SPARK-30121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-30121: - Priority: Trivial (was: Minor) > Fix memory usage in sbt build script > > > Key: SPARK-30121 > URL: https://issues.apache.org/jira/browse/SPARK-30121 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Kent Yao >Priority: Trivial > > 1. the default memory setting is missing in usage instructions > {code:java} > ``` > build/sbt -h > ``` > ``` > -mem set memory options (default: , which is -Xms2048m > -Xmx2048m -XX:ReservedCodeCacheSize=256m) > ``` > {code} > 2. the Perm space is not needed anymore, since java7 is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30142) Upgrade Maven to 3.6.3
Dongjoon Hyun created SPARK-30142: - Summary: Upgrade Maven to 3.6.3 Key: SPARK-30142 URL: https://issues.apache.org/jira/browse/SPARK-30142 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to upgrade to Maven 3.6.3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28961) Upgrade Maven to 3.6.2
[ https://issues.apache.org/jira/browse/SPARK-28961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28961. --- Fix Version/s: 3.0.0 Resolution: Fixed This is fixed https://github.com/apache/spark/pull/25665 > Upgrade Maven to 3.6.2 > -- > > Key: SPARK-28961 > URL: https://issues.apache.org/jira/browse/SPARK-28961 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > It looks like maven 3.6.1 is missing from the apache maven repo: > [http://apache.claz.org/maven/maven-3/] > This is causing PR build failures: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110045/console] > > exec: curl -s -L > > [https://www.apache.org/dyn/closer.lua?action=download=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz] > gzip: stdin: not in gzip format > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/SparkPullRequestBuilder/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/SparkPullRequestBuilder/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Error while getting version string from Maven: > h4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30129) New auth engine does not keep client ID in TransportClient after auth
[ https://issues.apache.org/jira/browse/SPARK-30129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988982#comment-16988982 ] Dongjoon Hyun commented on SPARK-30129: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/26764 > New auth engine does not keep client ID in TransportClient after auth > - > > Key: SPARK-30129 > URL: https://issues.apache.org/jira/browse/SPARK-30129 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Found a little bug when working on a feature; when auth is on, it's expected > that the {{TransportClient}} provides the authenticated ID of the client > (generally the app ID), but the new auth engine is not setting that > information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30129) New auth engine does not keep client ID in TransportClient after auth
[ https://issues.apache.org/jira/browse/SPARK-30129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30129: -- Fix Version/s: 2.4.5 > New auth engine does not keep client ID in TransportClient after auth > - > > Key: SPARK-30129 > URL: https://issues.apache.org/jira/browse/SPARK-30129 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Found a little bug when working on a feature; when auth is on, it's expected > that the {{TransportClient}} provides the authenticated ID of the client > (generally the app ID), but the new auth engine is not setting that > information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
[ https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30099: --- Assignee: Aman Omer (was: jobit mathew) > Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming > > > Key: SPARK-30099 > URL: https://issues.apache.org/jira/browse/SPARK-30099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: Aman Omer >Priority: Minor > Fix For: 3.0.0 > > > Spark SQL > explain extended select * from any non existing table shows duplicate > AnalysisExceptions. > {code:java} > spark-sql>explain extended select * from wrong > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation `wrong` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > Time taken: 6.0 seconds, Fetched 1 row(s) > 19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 > row > (s) > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
[ https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988840#comment-16988840 ] Wenchen Fan commented on SPARK-30099: - ah sorry I made a mistake. Fixed now. > Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming > > > Key: SPARK-30099 > URL: https://issues.apache.org/jira/browse/SPARK-30099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: Aman Omer >Priority: Minor > Fix For: 3.0.0 > > > Spark SQL > explain extended select * from any non existing table shows duplicate > AnalysisExceptions. > {code:java} > spark-sql>explain extended select * from wrong > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation `wrong` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > Time taken: 6.0 seconds, Fetched 1 row(s) > 19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 > row > (s) > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30141) Fix QueryTest.checkAnswer usage
Yuming Wang created SPARK-30141: --- Summary: Fix QueryTest.checkAnswer usage Key: SPARK-30141 URL: https://issues.apache.org/jira/browse/SPARK-30141 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30140) Code comment error
[ https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuv1up updated SPARK-30140: --- Attachment: (was: 1575546368156.jpg) > Code comment error > --- > > Key: SPARK-30140 > URL: https://issues.apache.org/jira/browse/SPARK-30140 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: wuv1up >Priority: Trivial > > ignore... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30140) Code comment error
[ https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuv1up updated SPARK-30140: --- Description: ignore... (was: !image-2019-12-05-19-44-08-141.png! I think the red box is writen as transitivity.) > Code comment error > --- > > Key: SPARK-30140 > URL: https://issues.apache.org/jira/browse/SPARK-30140 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: wuv1up >Priority: Trivial > > ignore... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30140) Code comment error
[ https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuv1up updated SPARK-30140: --- Comment: was deleted (was: The picture seems to be hanging. The error in clean method of ClosureCleaner.scala.) > Code comment error > --- > > Key: SPARK-30140 > URL: https://issues.apache.org/jira/browse/SPARK-30140 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: wuv1up >Priority: Trivial > Attachments: 1575546368156.jpg > > > !image-2019-12-05-19-44-08-141.png! > I think the red box is writen as transitivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30140) Code comment error
[ https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuv1up updated SPARK-30140: --- Attachment: 1575546368156.jpg > Code comment error > --- > > Key: SPARK-30140 > URL: https://issues.apache.org/jira/browse/SPARK-30140 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: wuv1up >Priority: Trivial > Attachments: 1575546368156.jpg > > > !image-2019-12-05-19-44-08-141.png! > I think the red box is writen as transitivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30140) Code comment error
[ https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988711#comment-16988711 ] wuv1up commented on SPARK-30140: The picture seems to be hanging. The error in clean method of ClosureCleaner.scala. > Code comment error > --- > > Key: SPARK-30140 > URL: https://issues.apache.org/jira/browse/SPARK-30140 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: wuv1up >Priority: Trivial > > !image-2019-12-05-19-44-08-141.png! > I think the red box is writen as transitivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30140) Code comment error
wuv1up created SPARK-30140: -- Summary: Code comment error Key: SPARK-30140 URL: https://issues.apache.org/jira/browse/SPARK-30140 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: wuv1up !image-2019-12-05-19-44-08-141.png! I think the red box is writen as transitivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30139) get_json_object does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988693#comment-16988693 ] Rakesh Raushan commented on SPARK-30139: I will look into this issue. > get_json_object does not work correctly > --- > > Key: SPARK-30139 > URL: https://issues.apache.org/jira/browse/SPARK-30139 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Clemens Valiente >Priority: Major > > according to documentation: > [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-] > get_json_object "Extracts json object from a json string based on json path > specified, and returns json string of the extracted json object. It will > return null if the input json string is invalid." > > the following SQL snippet returns null even though it should return 'a' > {code} > select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], > $[?($.id==123)].value){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988676#comment-16988676 ] sam commented on SPARK-30101: - [~kabhwan] [~cloud_fan] [~sowen] > We may deal with it we strongly agree about needs for prioritizing this. Oh great thanks. I think part of the problem here is Google SEO is broken because it's Algorithm has been trained by RDD. Googling how to set parallism always gives `spark.default.parallelism`. Even if you Google "set default parallelism dataset spark" it still doesn't take you to http://spark.apache.org/docs/latest/sql-performance-tuning.html I think setting parallelism is indeed one of the most important things you would ever need to do in Spark, so yes making it easier to find this would be super helpful to the community. > spark.sql.shuffle.partitions is not in Configuration docs, but a very > critical parameter > > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. > ... > According to below comments, it uses spark.sql.shuffle.partitions, which > needs documenting in configuration. > > Default number of partitions in RDDs returned by transformations like join, > > reduceByKey, and parallelize when not set by user. > in https://spark.apache.org/docs/latest/configuration.html should say > > Default number of partitions in RDDs, but not DS/DF (see > > spark.sql.shuffle.partitions) returned by transformations like join, > > reduceByKey, and parallelize when not set by user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30139) get_json_object does not work correctly
Clemens Valiente created SPARK-30139: Summary: get_json_object does not work correctly Key: SPARK-30139 URL: https://issues.apache.org/jira/browse/SPARK-30139 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Clemens Valiente according to documentation: [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-] get_json_object "Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid." the following SQL snippet returns null even though it should return 'a' {code} select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], $[?($.id==123)].value){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
[ https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988607#comment-16988607 ] Aman Omer commented on SPARK-30099: --- [~cloud_fan] can you assign this jira ticket to me? id: aman_omer > Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming > > > Key: SPARK-30099 > URL: https://issues.apache.org/jira/browse/SPARK-30099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: jobit mathew >Priority: Minor > Fix For: 3.0.0 > > > Spark SQL > explain extended select * from any non existing table shows duplicate > AnalysisExceptions. > {code:java} > spark-sql>explain extended select * from wrong > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation `wrong` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > Time taken: 6.0 seconds, Fetched 1 row(s) > 19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 > row > (s) > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer
[ https://issues.apache.org/jira/browse/SPARK-30138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30138: -- Description: Currently, both Analyzer and Optimizer use conf "spark.sql.optimizer.excludedRules" to set the max iterations to run, which is a little confusing. It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for analyzer max iterations. > Separate configuration key of max iterations for analyzer and optimizer > --- > > Key: SPARK-30138 > URL: https://issues.apache.org/jira/browse/SPARK-30138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Currently, both Analyzer and Optimizer use conf > "spark.sql.optimizer.excludedRules" to set the max iterations to run, which > is a little confusing. > It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for > analyzer max iterations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer
Hu Fuwang created SPARK-30138: - Summary: Separate configuration key of max iterations for analyzer and optimizer Key: SPARK-30138 URL: https://issues.apache.org/jira/browse/SPARK-30138 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Hu Fuwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29425) Alter database statement erases hive database's ownership
[ https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29425. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26080 [https://github.com/apache/spark/pull/26080] > Alter database statement erases hive database's ownership > - > > Key: SPARK-29425 > URL: https://issues.apache.org/jira/browse/SPARK-29425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will > erase a hive database's owner -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29425) Alter database statement erases hive database's ownership
[ https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29425: --- Assignee: Kent Yao > Alter database statement erases hive database's ownership > - > > Key: SPARK-29425 > URL: https://issues.apache.org/jira/browse/SPARK-29425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will > erase a hive database's owner -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29860. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26485 [https://github.com/apache/spark/pull/26485] > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Major > Fix For: 3.0.0 > > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Exception information > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29860: --- Assignee: feiwang > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Exception information > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org