[jira] [Updated] (SPARK-30262) Fix NumberFormatException when totalSize is empty
[ https://issues.apache.org/jira/browse/SPARK-30262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30262: -- Description: For Spark2.3.0+, we could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like {code:java} desc formatted partitions {code} ,the NumberFormatException is showed as below: {code:java} spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124) {code} was: For Spark2.3.0+, we could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like {code:java} desc formatted partitions {code} ,the NumberFormatException is showed as below: {code:java} spark-sql> desc formatted gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219) at org.apache.spark
[jira] [Updated] (SPARK-30262) Fix NumberFormatException when totalSize is empty
[ https://issues.apache.org/jira/browse/SPARK-30262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30262: -- Description: For Spark2.3.0+, we could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like {code:java} desc formatted partition{code} ,the NumberFormatException is showed as below: {code:java} spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124) {code} was: For Spark2.3.0+, we could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like {code:java} desc formatted partitions {code} ,the NumberFormatException is showed as below: {code:java} spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218) at org.a
[jira] [Created] (SPARK-30262) Fix NumberFormatException when totalSize is empty
chenliang created SPARK-30262: - Summary: Fix NumberFormatException when totalSize is empty Key: SPARK-30262 URL: https://issues.apache.org/jira/browse/SPARK-30262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: chenliang Fix For: 2.4.3, 2.3.2 For Spark2.3.0+, we could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like {code:java} desc formatted partitions {code} ,the NumberFormatException is showed as below: {code:java} spark-sql> desc formatted gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056) at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656) at org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation
[ https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30261: -- Description: For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. The problem can be reproduced as described in the below: 1.First I create a table with username='xie' and then \{{desc formatted table }},the owner is 'xiepengjie' {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s) {code} 2.Then I use another username='johnchen' and execute {{alter table bigdata_test.tt1 set location 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner of hive table is 'johnchen', it's unresonable {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, Fetched 18 row(s){code} was: For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. The problem can be reproduced as described in the below: 1.First I create a table with username='xie' and then \{{desc formatted table }},the owner is 'xiepengjie' {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s) {code} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, Fetched 18 row(s){code} {{}} > Should not change owner of hive table for some commands like 'alter' > operation > > > Key: SPARK-30261 > URL: https://issues.apache.org/jira/browse/SPARK-30261 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3 >Reporter: chenliang >Priority: Critical > Fix For: 2.2.0, 2.3.0, 2.4.3 >
[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation
[ https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30261: -- Description: For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. The problem can be reproduced as described in the below: 1.First I create a table with username='xie' and then \{{desc formatted table }},the owner is 'xiepengjie' {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s) {code} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, Fetched 18 row(s){code} {{}} was: For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. The problem can be reproduced as described in the below: # First I create a table with username='xie' and then {{desc formatted table }},the owner is 'xiepengjie' {{}} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s){code} {{}} # Then I use another username='johnchen' and execute {{alter table bigdata_test.tt1 set location 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner of hive table is 'johnchen', it's unresonable {{}} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, Fetched 18 row(s){code} {{}} > Should not change owner of hive table for some commands like 'alter' > operation > > > Key: SPARK-30261 > URL: https://issues.apache.org/jira/browse/SPARK-30261 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3 >Reporter: chenliang >Priority: Critical > Fix For: 2.2.0, 2.3.0,
[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation
[ https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30261: -- Description: For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. The problem can be reproduced as described in the below: # First I create a table with username='xie' and then {{desc formatted table }},the owner is 'xiepengjie' {{}} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s){code} {{}} # Then I use another username='johnchen' and execute {{alter table bigdata_test.tt1 set location 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner of hive table is 'johnchen', it's unresonable {{}} {code:java} spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int NULL # Detailed Table Information Database bigdata_test Table tt1 Owner johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, Fetched 18 row(s){code} {{}} was:For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. > Should not change owner of hive table for some commands like 'alter' > operation > > > Key: SPARK-30261 > URL: https://issues.apache.org/jira/browse/SPARK-30261 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3 >Reporter: chenliang >Priority: Critical > Fix For: 2.2.0, 2.3.0, 2.4.3 > > > For SparkSQL,When we do some alter operations on hive table, the owner of > hive table would be changed to someone who invoked the operation, it's > unresonable. And in fact, the owner should not changed for the real > prodcution environment, otherwise the authority check is out of order. > The problem can be reproduced as described in the below: > # First I create a table with username='xie' and then {{desc formatted table > }},the owner is 'xiepengjie' > > {{}} > {code:java} > spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int > NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie > Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST > 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table > Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, > TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 > Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat > org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties > [serialization.format=1] Partition Provider Catalog Time taken: 0.371 > seconds, Fetched 18 row(s){code} > {{}} > # Then I use another username='johnchen' and execute {{alter table > bigdata_test.tt1 set location > 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner > of hive table is 'johnchen', it's unresonable > > {{}} > {code:java} > spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int > NULL # Detailed Table Information Database bigdata_test Table tt1 Owner > johnchen Created Time Wed Sep 11 11
[jira] [Commented] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
[ https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996224#comment-16996224 ] hehuiyuan commented on SPARK-29940: --- Hi , anyone ? > Whether contains schema for this parameter "spark.yarn.historyServer.address" > -- > > Key: SPARK-29940 > URL: https://issues.apache.org/jira/browse/SPARK-29940 > Project: Spark > Issue Type: Wish > Components: Documentation >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > Attachments: image-2019-11-18-15-44-10-358.png, > image-2019-11-18-15-45-33-295.png > > > > !image-2019-11-18-15-44-10-358.png|width=815,height=156! > > !image-2019-11-18-15-45-33-295.png|width=673,height=273! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation
chenliang created SPARK-30261: - Summary: Should not change owner of hive table for some commands like 'alter' operation Key: SPARK-30261 URL: https://issues.apache.org/jira/browse/SPARK-30261 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 2.3.0, 2.2.0 Reporter: chenliang Fix For: 2.4.3, 2.3.0, 2.2.0 For SparkSQL,When we do some alter operations on hive table, the owner of hive table would be changed to someone who invoked the operation, it's unresonable. And in fact, the owner should not changed for the real prodcution environment, otherwise the authority check is out of order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30250) SparkQL div is undocumented
[ https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996209#comment-16996209 ] Michael Chirico commented on SPARK-30250: - Doubly great news! Thanks > SparkQL div is undocumented > --- > > Key: SPARK-30250 > URL: https://issues.apache.org/jira/browse/SPARK-30250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Michael Chirico >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-15407 > Mentions the div operator in SparkQL. > However, it's undocumented in the SQL API docs: > https://spark.apache.org/docs/latest/api/sql/index.html > It's documented in the HiveQL docs: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Target Version/s: 2.4.3, 2.3.0 (was: 2.4.3) > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0, 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load jar to current classpath and > would throw ClassNotFoundException,the problem can be reproduced as described > in the below. > {code:java} > scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() > -- > |bigdata_test.Add(1, 2)| > -- > | 3| > -- > scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() > org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF > 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; > line 1 pos 8 > at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251) > at > org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56) > at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56) > at > org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60) > at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59) > at > org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77) > at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77) > at > org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79) > at > org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Fix Version/s: 2.3.0 > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0, 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load jar to current classpath and > would throw ClassNotFoundException,the problem can be reproduced as described > in the below. > {code:java} > scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() > -- > |bigdata_test.Add(1, 2)| > -- > | 3| > -- > scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() > org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF > 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; > line 1 pos 8 > at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251) > at > org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56) > at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56) > at > org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60) > at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59) > at > org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77) > at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77) > at > org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79) > at > org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Description: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. {{}} {code:java} scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() -- |bigdata_test.Add(1, 2)| -- | 3| -- scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; line 1 pos 8 at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251) at org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60) at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71) at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code} was: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} {{+--+}} {{|bigdata_test.Add(1, 2)|}} {{+--+}} {{| 3|}} {{+--+}} {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} {{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; line 1 pos 8}} {{ }}{{at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}} {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}} {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}} {{ }}{{at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)}} {{ }}{{at scala.util.Try.getOrElse(Try.scala:79)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)}} {{ }}{{at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133)}} > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.4.3 > > > When
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Description: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. {code:java} scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() -- |bigdata_test.Add(1, 2)| -- | 3| -- scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; line 1 pos 8 at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251) at org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60) at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71) at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code} was: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. {{}} {code:java} scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() -- |bigdata_test.Add(1, 2)| -- | 3| -- scala> val res = spark.sql("select bigdata_test.Add(1,2)").show() org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; line 1 pos 8 at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251) at org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56) at org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60) at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71) at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code} > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Description: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} {{+--+}} {{|bigdata_test.Add(1, 2)|}} {{+--+}} {{| 3|}} {{+--+}} {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} {{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; line 1 pos 8}} {{ }}{{at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}} {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}} {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}} {{ }}{{at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}} {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)}} {{ }}{{at scala.util.Try.getOrElse(Try.scala:79)}} {{ }}{{at org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)}} {{ }}{{at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133)}} was: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load jar to current classpath and > would throw ClassNotFoundException,the problem can be reproduced as described > in the below. > > {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} > {{+--+}} > {{|bigdata_test.Add(1, 2)|}} > {{+--+}} > {{| 3|}} > {{+--+}} > {{scala> val res = spark.sql("select bigdata_test.Add(1,2)").show()}} > {{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF > 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; > line 1 pos 8}} > {{ }}{{at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}} > {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}} > {{ }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}} > {{ }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}} > {{ }}{{at > org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}} > {{ }}{{at > org.apache.spark.sql.hive.Hiv
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Description: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw ClassNotFoundException,the problem can be reproduced as described in the below. was:When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load jar to current classpath and > would throw ClassNotFoundException,the problem can be reproduced as described > in the below. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
[ https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-30260: -- Description: When we start spark-shell and use the udf for the first statement ,it's ok. But for the other statements it failed to load jar to current classpath and would throw > Spark-Shell throw ClassNotFoundException exception for more than one > statement to use UDF jar > - > > Key: SPARK-30260 > URL: https://issues.apache.org/jira/browse/SPARK-30260 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4 >Reporter: chenliang >Priority: Major > Fix For: 2.4.3 > > > When we start spark-shell and use the udf for the first statement ,it's ok. > But for the other statements it failed to load jar to current classpath and > would throw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar
chenliang created SPARK-30260: - Summary: Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar Key: SPARK-30260 URL: https://issues.apache.org/jira/browse/SPARK-30260 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.4.4, 2.4.3, 2.3.0, 2.2.0 Reporter: chenliang Fix For: 2.4.3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified
[ https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30259: -- Description: Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg. {code:java} CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; {code} the error message is like below: {noformat} 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException Error in query: Database 'spark_catalog' not found;{noformat} was: Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg. {code:java} // code placeholder CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; {code} the error message is like below: {noformat} 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException Error in query: Database 'spark_catalog' not found;{noformat} > CREATE TABLE throw error when session catalog specified > --- > > Key: SPARK-30259 > URL: https://issues.apache.org/jira/browse/SPARK-30259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Spark throw error when the session catalog is specified explicitly in "CREATE > TABLE" and "CREATE TABLE AS SELECT" command, eg. > {code:java} > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; > {code} > the error message is like below: > {noformat} > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table > : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr > cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, > returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found;{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified
[ https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30259: -- Description: Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg. {code:java} // code placeholder CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; {code} the error message is like below: {noformat} 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException Error in query: Database 'spark_catalog' not found;{noformat} was: Spark throw error when the session catalog is specified explicitly in the CREATE TABLE AS SELECT command, eg. {code:java} // code placeholder CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; {code} the error message is like below: {noformat} 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException Error in query: Database 'spark_catalog' not found;{noformat} > CREATE TABLE throw error when session catalog specified > --- > > Key: SPARK-30259 > URL: https://issues.apache.org/jira/browse/SPARK-30259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Spark throw error when the session catalog is specified explicitly in "CREATE > TABLE" and "CREATE TABLE AS SELECT" command, eg. > {code:java} > // code placeholder > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; > {code} > the error message is like below: > {noformat} > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table > : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr > cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, > returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found;{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified
[ https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30259: -- Summary: CREATE TABLE throw error when session catalog specified (was: CREATE TABLE AS SELECT throw error when session catalog specified) > CREATE TABLE throw error when session catalog specified > --- > > Key: SPARK-30259 > URL: https://issues.apache.org/jira/browse/SPARK-30259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Spark throw error when the session catalog is specified explicitly in the > CREATE TABLE AS SELECT command, eg. > {code:java} > // code placeholder > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; > {code} > the error message is like below: > {noformat} > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table > : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr > cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, > returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found;{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30259) CREATE TABLE AS SELECT throw error when session catalog specified
[ https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30259: -- Description: Spark throw error when the session catalog is specified explicitly in the CREATE TABLE AS SELECT command, eg. {code:java} // code placeholder CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; {code} the error message is like below: {noformat} 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException Error in query: Database 'spark_catalog' not found;{noformat} > CREATE TABLE AS SELECT throw error when session catalog specified > - > > Key: SPARK-30259 > URL: https://issues.apache.org/jira/browse/SPARK-30259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Spark throw error when the session catalog is specified explicitly in the > CREATE TABLE AS SELECT command, eg. > {code:java} > // code placeholder > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; > {code} > the error message is like below: > {noformat} > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table > : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr > cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, > returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found;{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30259) CREATE TABLE AS SELECT throw error when session catalog specified
Hu Fuwang created SPARK-30259: - Summary: CREATE TABLE AS SELECT throw error when session catalog specified Key: SPARK-30259 URL: https://issues.apache.org/jira/browse/SPARK-30259 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Hu Fuwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3
[ https://issues.apache.org/jira/browse/SPARK-30190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30190: -- Affects Version/s: (was: 2.4.4) 3.0.0 > HistoryServerDiskManager will fail on appStoreDir in s3 > --- > > Key: SPARK-30190 > URL: https://issues.apache.org/jira/browse/SPARK-30190 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: thierry accart >Priority: Major > > Hi > While setting spark.eventLog.dir to s3a://... I realized that it *requires > destination directory to preexists for S3* > This is explained I think in HistoryServerDiskManager's appStoreDir: it tries > check if directory exists or can be created > {{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new > IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}} > But in S3, a directory does not exists and cannot be created: directories > don't exists by themselves, they are only materialized due to existence of > objects. > Before proposing a patch, I wanted to know what are the prefered options : > should we have a spark option to skip the appStoreDir test, or skip it only > when a particular scheme is set, have a custom implementation of > HistoryServerDiskManager ...? > > _Note for people facing the {{IllegalArgumentException:}} {{Failed to create > app directory}} *you just have to put an empty file in bucket destination > 'path'*._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996084#comment-16996084 ] Dongjoon Hyun commented on SPARK-30218: --- Thank you for reporting with the investigation result, [~FC]. > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996082#comment-16996082 ] Evgenii edited comment on SPARK-23015 at 12/14/19 2:25 AM: --- We invoke spark-submit from Java code in parallel too was (Author: lartcev): We invoke it from Java code in parallel too. > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: > {quote}The process cannot access the file because it is being used by another > process. The system cannot find the file > USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt. > The process cannot access the file because it is being used by another > process.{quote} > My hypothesis is that %RANDOM% is returning the same value for multiple jobs, > causing the launcher library to attempt to write to the same file from > multiple processes. Another mechanism is needed for reliably generating the > names of the temporary files so that the concurrency issue is resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996082#comment-16996082 ] Evgenii commented on SPARK-23015: - We invoke it from Java code in parallel too. > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: > {quote}The process cannot access the file because it is being used by another > process. The system cannot find the file > USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt. > The process cannot access the file because it is being used by another > process.{quote} > My hypothesis is that %RANDOM% is returning the same value for multiple jobs, > causing the launcher library to attempt to write to the same file from > multiple processes. Another mechanism is needed for reliably generating the > names of the temporary files so that the concurrency issue is resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996080#comment-16996080 ] Evgenii edited comment on SPARK-23015 at 12/14/19 2:23 AM: --- Guys, why not to invoke %RANDOM% multiple times? Just change the spark-class2.cmd set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt to set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%RANDOM%%RANDOM%.txt was (Author: lartcev): Guys, why not to invoke %RANDOM% multiple times? Just change the spark-class2.cmd set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt to set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%-%RANDOM%-%RANDOM%.txt > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: > {quote}The process cannot access the file because it is being used by another > process. The system cannot find the file > USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt. > The process cannot access the file because it is being used by another > process.{quote} > My hypothesis is that %RANDOM% is returning the same value for multiple jobs, > causing the launcher library to attempt to write to the same file from > multiple processes. Another mechanism is needed for reliably generating the > names of the temporary files so that the concurrency issue is resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996080#comment-16996080 ] Evgenii commented on SPARK-23015: - Guys, why not to invoke %RANDOM% multiple times? Just change the spark-class2.cmd set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt to set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%-%RANDOM%-%RANDOM%.txt > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: > {quote}The process cannot access the file because it is being used by another > process. The system cannot find the file > USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt. > The process cannot access the file because it is being used by another > process.{quote} > My hypothesis is that %RANDOM% is returning the same value for multiple jobs, > causing the launcher library to attempt to write to the same file from > multiple processes. Another mechanism is needed for reliably generating the > names of the temporary files so that the concurrency issue is resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30218: -- Affects Version/s: (was: 2.4.3) (was: 2.4.2) (was: 2.4.1) 2.3.4 > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30218: -- Affects Version/s: 2.4.2 2.4.3 2.4.4 > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17398) Failed to query on external JSon Partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wing Yew Poon reopened SPARK-17398: --- This issue was never actually fixed. Evidently the problem still exists. I'll create a PR with a fix. > Failed to query on external JSon Partitioned table > -- > > Key: SPARK-17398 > URL: https://issues.apache.org/jira/browse/SPARK-17398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang >Priority: Major > Fix For: 2.0.1 > > Attachments: screenshot-1.png > > > 1. Create External Json partitioned table > with SerDe in hive-hcatalog-core-1.2.1.jar, download fom > https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 > 2. Query table meet exception, which works in spark1.5.2 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: > Lost task > 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: > java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord > at > org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > > 3. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object JsonBugs { > def main(args: Array[String]): Unit = { > val table = "test_json" > val location = "file:///g:/home/test/json" > val create = s"""CREATE EXTERNAL TABLE ${table} > (id string, seq string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE ${table} ADD > PARTITION (index=1)LOCATION '${location}/index=1' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) > if (!exist) { > hctx.sql(create) > hctx.sql(add_part) > } else { > hctx.sql("show partitions " + table).show() > } > hctx.sql("select * from test_json").show() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30194) Re-enable checkstyle for Java
[ https://issues.apache.org/jira/browse/SPARK-30194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30194. --- Resolution: Won't Do > Re-enable checkstyle for Java > - > > Key: SPARK-30194 > URL: https://issues.apache.org/jira/browse/SPARK-30194 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-30232) Fix the the ArthmeticException by zero when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-30232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-30232. - > Fix the the ArthmeticException by zero when enable AQE > -- > > Key: SPARK-30232 > URL: https://issues.apache.org/jira/browse/SPARK-30232 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > Add a check for the divisor to avoid the ArthmeticException by zero. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30212) COUNT(DISTINCT) window function should be supported
[ https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996067#comment-16996067 ] Dongjoon Hyun commented on SPARK-30212: --- Thank you for filing a JIRA, @Kernel Force . > COUNT(DISTINCT) window function should be supported > --- > > Key: SPARK-30212 > URL: https://issues.apache.org/jira/browse/SPARK-30212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.12 > Hive 2.3.6 >Reporter: Kernel Force >Priority: Major > > Suppose we have a typical table in Hive like below: > {code:sql} > CREATE TABLE DEMO_COUNT_DISTINCT ( > demo_date string, > demo_id string > ); > {code} > {noformat} > ++--+ > | demo_count_distinct.demo_date | demo_count_distinct.demo_id | > ++--+ > | 20180301 | 101 | > | 20180301 | 102 | > | 20180301 | 103 | > | 20180401 | 201 | > | 20180401 | 202 | > ++--+ > {noformat} > Now I want to count distinct number of DEMO_DATE but also reserve every > columns' data in each row. > So I use COUNT(DISTINCT) window function like below in Hive beeline and it > work: > {code:sql} > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T; > {code} > {noformat} > +--++-+ > | t.demo_date | t.demo_id | uniq_dates | > +--++-+ > | 20180401 | 202 | 2 | > | 20180401 | 201 | 2 | > | 20180301 | 103 | 2 | > | 20180301 | 102 | 2 | > | 20180301 | 101 | 2 | > +--++-+ > {noformat} > But when I came to SparkSQL, it threw exception even if I run the same SQL. > {code:sql} > spark.sql(""" > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Distinct window functions are not > supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), > unboundedfollowing$()));; > Project [demo_date#1, demo_id#2, UNIQ_DATES#0L] > +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L] > +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS UNIQ_DATES#0L], [null] > +- Project [demo_date#1, demo_id#2] > +- SubqueryAlias `T` > +- SubqueryAlias `default`.`demo_count_distinct` > +- HiveTableRelation `default`.`demo_count_distinct`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2] > {noformat} > Then I try to use countDistinct function but also got exceptions. > {code:sql} > spark.sql(""" > SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 2 pos 12 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > .. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported
[ https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30212: -- Affects Version/s: (was: 2.4.4) 3.0.0 > COUNT(DISTINCT) window function should be supported > --- > > Key: SPARK-30212 > URL: https://issues.apache.org/jira/browse/SPARK-30212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 2.4.4 > Scala 2.11.12 > Hive 2.3.6 >Reporter: Kernel Force >Priority: Major > > Suppose we have a typical table in Hive like below: > {code:sql} > CREATE TABLE DEMO_COUNT_DISTINCT ( > demo_date string, > demo_id string > ); > {code} > {noformat} > ++--+ > | demo_count_distinct.demo_date | demo_count_distinct.demo_id | > ++--+ > | 20180301 | 101 | > | 20180301 | 102 | > | 20180301 | 103 | > | 20180401 | 201 | > | 20180401 | 202 | > ++--+ > {noformat} > Now I want to count distinct number of DEMO_DATE but also reserve every > columns' data in each row. > So I use COUNT(DISTINCT) window function like below in Hive beeline and it > work: > {code:sql} > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T; > {code} > {noformat} > +--++-+ > | t.demo_date | t.demo_id | uniq_dates | > +--++-+ > | 20180401 | 202 | 2 | > | 20180401 | 201 | 2 | > | 20180301 | 103 | 2 | > | 20180301 | 102 | 2 | > | 20180301 | 101 | 2 | > +--++-+ > {noformat} > But when I came to SparkSQL, it threw exception even if I run the same SQL. > {code:sql} > spark.sql(""" > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Distinct window functions are not > supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), > unboundedfollowing$()));; > Project [demo_date#1, demo_id#2, UNIQ_DATES#0L] > +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L] > +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS UNIQ_DATES#0L], [null] > +- Project [demo_date#1, demo_id#2] > +- SubqueryAlias `T` > +- SubqueryAlias `default`.`demo_count_distinct` > +- HiveTableRelation `default`.`demo_count_distinct`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2] > {noformat} > Then I try to use countDistinct function but also got exceptions. > {code:sql} > spark.sql(""" > SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 2 pos 12 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > .. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported
[ https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30212: -- Issue Type: Improvement (was: Bug) > COUNT(DISTINCT) window function should be supported > --- > > Key: SPARK-30212 > URL: https://issues.apache.org/jira/browse/SPARK-30212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.12 > Hive 2.3.6 >Reporter: Kernel Force >Priority: Major > Labels: SQL, distinct, window_function > > Suppose we have a typical table in Hive like below: > {code:sql} > CREATE TABLE DEMO_COUNT_DISTINCT ( > demo_date string, > demo_id string > ); > {code} > {noformat} > ++--+ > | demo_count_distinct.demo_date | demo_count_distinct.demo_id | > ++--+ > | 20180301 | 101 | > | 20180301 | 102 | > | 20180301 | 103 | > | 20180401 | 201 | > | 20180401 | 202 | > ++--+ > {noformat} > Now I want to count distinct number of DEMO_DATE but also reserve every > columns' data in each row. > So I use COUNT(DISTINCT) window function like below in Hive beeline and it > work: > {code:sql} > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T; > {code} > {noformat} > +--++-+ > | t.demo_date | t.demo_id | uniq_dates | > +--++-+ > | 20180401 | 202 | 2 | > | 20180401 | 201 | 2 | > | 20180301 | 103 | 2 | > | 20180301 | 102 | 2 | > | 20180301 | 101 | 2 | > +--++-+ > {noformat} > But when I came to SparkSQL, it threw exception even if I run the same SQL. > {code:sql} > spark.sql(""" > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Distinct window functions are not > supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), > unboundedfollowing$()));; > Project [demo_date#1, demo_id#2, UNIQ_DATES#0L] > +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L] > +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS UNIQ_DATES#0L], [null] > +- Project [demo_date#1, demo_id#2] > +- SubqueryAlias `T` > +- SubqueryAlias `default`.`demo_count_distinct` > +- HiveTableRelation `default`.`demo_count_distinct`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2] > {noformat} > Then I try to use countDistinct function but also got exceptions. > {code:sql} > spark.sql(""" > SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 2 pos 12 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > .. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported
[ https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30212: -- Labels: (was: SQL distinct window_function) > COUNT(DISTINCT) window function should be supported > --- > > Key: SPARK-30212 > URL: https://issues.apache.org/jira/browse/SPARK-30212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.12 > Hive 2.3.6 >Reporter: Kernel Force >Priority: Major > > Suppose we have a typical table in Hive like below: > {code:sql} > CREATE TABLE DEMO_COUNT_DISTINCT ( > demo_date string, > demo_id string > ); > {code} > {noformat} > ++--+ > | demo_count_distinct.demo_date | demo_count_distinct.demo_id | > ++--+ > | 20180301 | 101 | > | 20180301 | 102 | > | 20180301 | 103 | > | 20180401 | 201 | > | 20180401 | 202 | > ++--+ > {noformat} > Now I want to count distinct number of DEMO_DATE but also reserve every > columns' data in each row. > So I use COUNT(DISTINCT) window function like below in Hive beeline and it > work: > {code:sql} > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T; > {code} > {noformat} > +--++-+ > | t.demo_date | t.demo_id | uniq_dates | > +--++-+ > | 20180401 | 202 | 2 | > | 20180401 | 201 | 2 | > | 20180301 | 103 | 2 | > | 20180301 | 102 | 2 | > | 20180301 | 101 | 2 | > +--++-+ > {noformat} > But when I came to SparkSQL, it threw exception even if I run the same SQL. > {code:sql} > spark.sql(""" > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Distinct window functions are not > supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), > unboundedfollowing$()));; > Project [demo_date#1, demo_id#2, UNIQ_DATES#0L] > +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L] > +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS UNIQ_DATES#0L], [null] > +- Project [demo_date#1, demo_id#2] > +- SubqueryAlias `T` > +- SubqueryAlias `default`.`demo_count_distinct` > +- HiveTableRelation `default`.`demo_count_distinct`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2] > {noformat} > Then I try to use countDistinct function but also got exceptions. > {code:sql} > spark.sql(""" > SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 2 pos 12 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > .. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30181) Throws runtime exception when filter metastore partition key that's not string type or integral types
[ https://issues.apache.org/jira/browse/SPARK-30181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996065#comment-16996065 ] L. C. Hsieh commented on SPARK-30181: - This should be fixed by SPARK-30238. > Throws runtime exception when filter metastore partition key that's not > string type or integral types > - > > Key: SPARK-30181 > URL: https://issues.apache.org/jira/browse/SPARK-30181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Yu-Jhe Li >Priority: Major > > SQL below will throw a runtime exception since spark-2.4.0. I think it's a > bug brought from SPARK-22384 > {code:scala} > spark.sql("CREATE TABLE timestamp_part (value INT) PARTITIONED BY (dt > TIMESTAMP)") > val df = Seq( > (1, java.sql.Timestamp.valueOf("2019-12-01 00:00:00"), 1), > (2, java.sql.Timestamp.valueOf("2019-12-01 01:00:00"), 1) > ).toDF("id", "dt", "value") > df.write.partitionBy("dt").mode("overwrite").saveAsTable("timestamp_part") > spark.sql("select * from timestamp_part where dt >= '2019-12-01 > 00:00:00'").explain(true) > {code} > {noformat} > Caught Hive MetaException attempting to get partition metadata by filter from > Hive. You can set the Spark configuration setting > spark.sql.hive.manageFilesourcePartitions to false to work around this > problem, however this will result in degraded performance. Please report a > bug: https://issues.apache.org/jira/browse/SPARK > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:774) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) > at > org.apache.spark.sql.hive.client.HiveClientSuite.testMetastorePartitionFiltering(HiveClientSuite.scala:310) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$testMetastorePartitionFiltering(HiveClientSuite.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply$mcV$sp(HiveClientSuite.scala:105) > at > org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105) > at > org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) >
[jira] [Commented] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996064#comment-16996064 ] Dongjoon Hyun commented on SPARK-30257: --- Hi, [~svanhooser]. I triggered the full Spark Jenkins test on your PR. > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shelby Vanhooser >Priority: Major > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30257: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shelby Vanhooser >Priority: Major > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30257: -- Priority: Major (was: Critical) > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Major > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30257: -- Labels: (was: PySpark feature) > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Critical > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30242) Support reading Parquet files from Stream Buffer
[ https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996046#comment-16996046 ] Jelther Oliveira Gonçalves commented on SPARK-30242: Hi [~dongjoon], thanks for the update. I've seen you have changed already. Thanks. > Support reading Parquet files from Stream Buffer > > > Key: SPARK-30242 > URL: https://issues.apache.org/jira/browse/SPARK-30242 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Jelther Oliveira Gonçalves >Priority: Trivial > > Reading from a Python BufferIO a parquet is not possible using Pyspark. > Using: > > {code:java} > from io import BytesIO > parquetbytes : Bytes = b'PAR...' > df = spark.read.format("parquet").load(BytesIO(parquetbytes)) > {code} > Raises : > {code:java} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > java.lang.String{code} > > Is there any chance this will be available in the future? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30239) Creating a dataframe with Pandas rather than Numpy datatypes fails
[ https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30239: -- Summary: Creating a dataframe with Pandas rather than Numpy datatypes fails (was: [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails) > Creating a dataframe with Pandas rather than Numpy datatypes fails > -- > > Key: SPARK-30239 > URL: https://issues.apache.org/jira/browse/SPARK-30239 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 > | Scala 2.11 >Reporter: Philip Kahn >Priority: Minor > > It's possible to work with DataFrames in Pandas and shuffle them back over to > Spark dataframes for processing; however, using Pandas extended datatypes > like {{Int64 }}( > [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) > throws an error (that long / float can't be converted). > This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} > allows only integers except for the single float value {{np.nan}}. > > The current workaround for this is to use the columns as floats, and after > conversion to the Spark DataFrame, to recast the column as {{LongType()}}. > For example: > > {{sdfC = spark.createDataFrame(kgridCLinked)}} > {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}} > > However, this is awkward and redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer
[ https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30242: -- Component/s: (was: SQL) PySpark > Support reading Parquet files from Stream Buffer > > > Key: SPARK-30242 > URL: https://issues.apache.org/jira/browse/SPARK-30242 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Jelther Oliveira Gonçalves >Priority: Trivial > > Reading from a Python BufferIO a parquet is not possible using Pyspark. > Using: > > {code:java} > from io import BytesIO > parquetbytes : Bytes = b'PAR...' > df = spark.read.format("parquet").load(BytesIO(parquetbytes)) > {code} > Raises : > {code:java} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > java.lang.String{code} > > Is there any chance this will be available in the future? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30239) [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails
[ https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30239: -- Labels: (was: easyfix) > [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails > --- > > Key: SPARK-30239 > URL: https://issues.apache.org/jira/browse/SPARK-30239 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 > | Scala 2.11 >Reporter: Philip Kahn >Priority: Minor > > It's possible to work with DataFrames in Pandas and shuffle them back over to > Spark dataframes for processing; however, using Pandas extended datatypes > like {{Int64 }}( > [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) > throws an error (that long / float can't be converted). > This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} > allows only integers except for the single float value {{np.nan}}. > > The current workaround for this is to use the columns as floats, and after > conversion to the Spark DataFrame, to recast the column as {{LongType()}}. > For example: > > {{sdfC = spark.createDataFrame(kgridCLinked)}} > {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}} > > However, this is awkward and redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30242) Support reading Parquet files from Stream Buffer
[ https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996045#comment-16996045 ] Dongjoon Hyun commented on SPARK-30242: --- Hi, [~jetolgon]. Thank you for suggestion. For the new feature, you need to set the next version of master branch. As of today, it's 3.0.0 . > Support reading Parquet files from Stream Buffer > > > Key: SPARK-30242 > URL: https://issues.apache.org/jira/browse/SPARK-30242 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Jelther Oliveira Gonçalves >Priority: Trivial > > Reading from a Python BufferIO a parquet is not possible using Pyspark. > Using: > > {code:java} > from io import BytesIO > parquetbytes : Bytes = b'PAR...' > df = spark.read.format("parquet").load(BytesIO(parquetbytes)) > {code} > Raises : > {code:java} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > java.lang.String{code} > > Is there any chance this will be available in the future? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer
[ https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30242: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Support reading Parquet files from Stream Buffer > > > Key: SPARK-30242 > URL: https://issues.apache.org/jira/browse/SPARK-30242 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jelther Oliveira Gonçalves >Priority: Trivial > > Reading from a Python BufferIO a parquet is not possible using Pyspark. > Using: > > {code:java} > from io import BytesIO > parquetbytes : Bytes = b'PAR...' > df = spark.read.format("parquet").load(BytesIO(parquetbytes)) > {code} > Raises : > {code:java} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > java.lang.String{code} > > Is there any chance this will be available in the future? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer
[ https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30242: -- Component/s: (was: Spark Core) SQL > Support reading Parquet files from Stream Buffer > > > Key: SPARK-30242 > URL: https://issues.apache.org/jira/browse/SPARK-30242 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jelther Oliveira Gonçalves >Priority: Trivial > > Reading from a Python BufferIO a parquet is not possible using Pyspark. > Using: > > {code:java} > from io import BytesIO > parquetbytes : Bytes = b'PAR...' > df = spark.read.format("parquet").load(BytesIO(parquetbytes)) > {code} > Raises : > {code:java} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > java.lang.String{code} > > Is there any chance this will be available in the future? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed
[ https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996044#comment-16996044 ] Dongjoon Hyun commented on SPARK-30249: --- I believe it's prevented because ORC format doesn't support that. When you use those column in Parquet file, does Parquet table work incorrectly? I didn't test it, but It might be a valid format in Parquet file format. > Invalid Column Names in parquet tables should not be allowed > > > Key: SPARK-30249 > URL: https://issues.apache.org/jira/browse/SPARK-30249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we > are creating parquet tables. > While when we are creating tables with `orc` all such column names are marked > as invalid and analysis exception is thrown. > These column names should also be not allowed for parquet tables as well. > Also this induces inconsistency between column names for Parquet and ORC -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30250) SparkQL div is undocumented
[ https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30250. --- Resolution: Duplicate Thank you for filing a JIRA, [~michaelchirico]. SPARK-16323 added at 3.0.0. - [https://github.com/apache/spark/commit/553af22f2c8ecdc039c8d06431564b1432e60d2d] And, it's documented at 3.0.0-preview doc - https://spark.apache.org/docs/3.0.0-preview/api/sql/#div > SparkQL div is undocumented > --- > > Key: SPARK-30250 > URL: https://issues.apache.org/jira/browse/SPARK-30250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Michael Chirico >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-15407 > Mentions the div operator in SparkQL. > However, it's undocumented in the SQL API docs: > https://spark.apache.org/docs/latest/api/sql/index.html > It's documented in the HiveQL docs: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30250) SparkQL div is undocumented
[ https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30250: -- Affects Version/s: (was: 2.4.4) 3.0.0 > SparkQL div is undocumented > --- > > Key: SPARK-30250 > URL: https://issues.apache.org/jira/browse/SPARK-30250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Michael Chirico >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-15407 > Mentions the div operator in SparkQL. > However, it's undocumented in the SQL API docs: > https://spark.apache.org/docs/latest/api/sql/index.html > It's documented in the HiveQL docs: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28264: Priority: Blocker (was: Critical) > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-30251) faster way to read csv.gz?
[ https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-30251. - > faster way to read csv.gz? > -- > > Key: SPARK-30251 > URL: https://issues.apache.org/jira/browse/SPARK-30251 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: t oo >Priority: Major > > some data providers give files in csv.gz (ie 1gb compressed which is 25gb > uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed > which is 2.5gb uncompressed), now when i tell my boss that famous big data > tool spark takes 16hrs to convert the 1gb compressed into parquet then there > is look of shock. this is batch data we receive daily (80gb compressed, 2tb > uncompressed every day spread across ~300 files). > i know gz is not splittable so it ends up loaded on single worker. but we > dont have space/patience to do a pre-conversion to bz2 or uncompressed. can > spark have a better codec? i saw posts mentioning even python is faster than > spark > > [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] > [https://github.com/nielsbasjes/splittablegzip] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30251) faster way to read csv.gz?
[ https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30251. --- Resolution: Invalid Hi, [~toopt4] . Sorry, but Jira is not for Q&A. You had better send an email to dev. > faster way to read csv.gz? > -- > > Key: SPARK-30251 > URL: https://issues.apache.org/jira/browse/SPARK-30251 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: t oo >Priority: Major > > some data providers give files in csv.gz (ie 1gb compressed which is 25gb > uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed > which is 2.5gb uncompressed), now when i tell my boss that famous big data > tool spark takes 16hrs to convert the 1gb compressed into parquet then there > is look of shock. this is batch data we receive daily (80gb compressed, 2tb > uncompressed every day spread across ~300 files). > i know gz is not splittable so it ends up loaded on single worker. but we > dont have space/patience to do a pre-conversion to bz2 or uncompressed. can > spark have a better codec? i saw posts mentioning even python is faster than > spark > > [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] > [https://github.com/nielsbasjes/splittablegzip] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
[ https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30143. - Fix Version/s: 3.0.0 Resolution: Done Resolved as part of [https://github.com/apache/spark/pull/26771] > StreamingQuery.stop() should not block indefinitely > --- > > Key: SPARK-30143 > URL: https://issues.apache.org/jira/browse/SPARK-30143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > The stop() method on a Streaming Query awaits the termination of the stream > execution thread. However, the stream execution thread may block forever > depending on the streaming source implementation (like in Kafka, which runs > UninterruptibleThreads). > This causes control flow applications to hang indefinitely as well. We'd like > to introduce a timeout to stop the execution thread, so that the control flow > thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
[ https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-30143: --- Assignee: Burak Yavuz > StreamingQuery.stop() should not block indefinitely > --- > > Key: SPARK-30143 > URL: https://issues.apache.org/jira/browse/SPARK-30143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > The stop() method on a Streaming Query awaits the termination of the stream > execution thread. However, the stream execution thread may block forever > depending on the streaming source implementation (like in Kafka, which runs > UninterruptibleThreads). > This causes control flow applications to hang indefinitely as well. We'd like > to introduce a timeout to stop the execution thread, so that the control flow > thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30167) Log4j configuration for REPL can't override the root logger properly.
[ https://issues.apache.org/jira/browse/SPARK-30167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-30167. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26798 [https://github.com/apache/spark/pull/26798] > Log4j configuration for REPL can't override the root logger properly. > - > > Key: SPARK-30167 > URL: https://issues.apache.org/jira/browse/SPARK-30167 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.0.0 > > > SPARK-11929 enabled REPL's log4j configuration to override root logger but > SPARK-26753 seems to have broken the feature. > You can see one example when you modifies the default log4j configuration > like as follows. > {code:java} > # Change the log level for rootCategory to DEBUG > log4j.rootCategory=DEBUG, console > ... > # The log level for repl.Main remains WARN > log4j.logger.org.apache.spark.repl.Main=WARN{code} > If you launch REPL with the configuration, INFO level logs appear even though > the log level for REPL is WARN. > {code:java} > ・・・ > 19/12/08 23:31:38 INFO Utils: Successfully started service 'sparkDriver' on > port 33083. > 19/12/08 23:31:38 INFO SparkEnv: Registering MapOutputTracker > 19/12/08 23:31:38 INFO SparkEnv: Registering BlockManagerMaster > 19/12/08 23:31:38 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 19/12/08 23:31:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 19/12/08 23:31:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat > ・・・{code} > > Before SPARK-26753 was applied, those INFO level logs are not shown with the > same log4j.properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-30077. Resolution: Invalid > create TEMPORARY VIEW USING should look up catalog/table like v2 commands > - > > Key: SPARK-30077 > URL: https://issues.apache.org/jira/browse/SPARK-30077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > create TEMPORARY VIEW USING should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995945#comment-16995945 ] Pablo Langa Blanco commented on SPARK-30077: [~huaxingao] Can we close this ticket? Reading the comments I undestand we are not going to do this change. Thanks > create TEMPORARY VIEW USING should look up catalog/table like v2 commands > - > > Key: SPARK-30077 > URL: https://issues.apache.org/jira/browse/SPARK-30077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > create TEMPORARY VIEW USING should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29563) CREATE TABLE LIKE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995942#comment-16995942 ] Pablo Langa Blanco commented on SPARK-29563: [~dkbiswal] Are you still working on this? If not, I can continue. Thanks > CREATE TABLE LIKE should look up catalog/table like v2 commands > --- > > Key: SPARK-29563 > URL: https://issues.apache.org/jira/browse/SPARK-29563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995896#comment-16995896 ] Shelby Vanhooser commented on SPARK-30257: -- All tests passing! > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Critical > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shelby Vanhooser updated SPARK-30257: - Labels: PySpark feature (was: ) > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Critical > Labels: PySpark, feature > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30258) Eliminate warnings of deprecated Spark APIs in tests
[ https://issues.apache.org/jira/browse/SPARK-30258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30258: --- Summary: Eliminate warnings of deprecated Spark APIs in tests (was: Eliminate warnings of depracted Spark APIs in tests) > Eliminate warnings of deprecated Spark APIs in tests > > > Key: SPARK-30258 > URL: https://issues.apache.org/jira/browse/SPARK-30258 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Suppress deprecation warnings in tests that check deprecated Spark APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30258) Eliminate warnings of depracted Spark APIs in tests
Maxim Gekk created SPARK-30258: -- Summary: Eliminate warnings of depracted Spark APIs in tests Key: SPARK-30258 URL: https://issues.apache.org/jira/browse/SPARK-30258 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Maxim Gekk Suppress deprecation warnings in tests that check deprecated Spark APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995889#comment-16995889 ] Samuel Shepard commented on SPARK-6235: --- One use case could be fetching large results to the driver when computing PCA on large square matrices (e.g., distance matrices, similar to Classical MDS). This is very helpful in bioinformatics. Sorry if this already fixed past 2.4.0... > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29449) Add tooltip to Spark WebUI
[ https://issues.apache.org/jira/browse/SPARK-29449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29449. -- Fix Version/s: 3.0.0 Resolution: Done > Add tooltip to Spark WebUI > -- > > Key: SPARK-29449 > URL: https://issues.apache.org/jira/browse/SPARK-29449 > Project: Spark > Issue Type: Umbrella > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > The initial effort was made in > https://issues.apache.org/jira/browse/SPARK-2384. This umbrella Jira is to > track the progress of adding tooltip to all the WebUI for better usability. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29455) Improve tooltip information for Stages Tab
[ https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29455. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26859 [https://github.com/apache/spark/pull/26859] > Improve tooltip information for Stages Tab > -- > > Key: SPARK-29455 > URL: https://issues.apache.org/jira/browse/SPARK-29455 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Sharanabasappa G Keriwaddi >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29455) Improve tooltip information for Stages Tab
[ https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29455: - Assignee: Sharanabasappa G Keriwaddi > Improve tooltip information for Stages Tab > -- > > Key: SPARK-29455 > URL: https://issues.apache.org/jira/browse/SPARK-29455 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Sharanabasappa G Keriwaddi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30216) Use python3 in Docker release image
[ https://issues.apache.org/jira/browse/SPARK-30216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30216. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26848 [https://github.com/apache/spark/pull/26848] > Use python3 in Docker release image > --- > > Key: SPARK-30216 > URL: https://issues.apache.org/jira/browse/SPARK-30216 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30230) Like ESCAPE syntax can not use '_' and '%'
[ https://issues.apache.org/jira/browse/SPARK-30230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995838#comment-16995838 ] Dongjoon Hyun commented on SPARK-30230: --- The commit is reverted via https://github.com/apache/spark/commit/4da9780bc0a12672b45ffdcc28e594593bc68350 > Like ESCAPE syntax can not use '_' and '%' > -- > > Key: SPARK-30230 > URL: https://issues.apache.org/jira/browse/SPARK-30230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Major > > '%' and '_' is the reserve char in `Like` expression. We can not use them as > escape char. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30230) Like ESCAPE syntax can not use '_' and '%'
[ https://issues.apache.org/jira/browse/SPARK-30230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-30230: --- Assignee: (was: ulysses you) > Like ESCAPE syntax can not use '_' and '%' > -- > > Key: SPARK-30230 > URL: https://issues.apache.org/jira/browse/SPARK-30230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Major > > '%' and '_' is the reserve char in `Like` expression. We can not use them as > escape char. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shelby Vanhooser updated SPARK-30257: - Component/s: (was: Input/Output) PySpark > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Critical > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types
[ https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shelby Vanhooser updated SPARK-30257: - Description: The PySpark mapping from simpleString to Spark SQL types are too manual right now; instead, pyspark.sql.types should expose a method that maps the simpleString representation of these types to the underlying Spark SQL ones Tracked here : [https://github.com/apache/spark/pull/26884] was:The PySpark mapping from simpleString to Spark SQL types are too manual right now; instead, pyspark.sql.types should expose a method that maps the simpleString representation of these types to the underlying Spark SQL ones > Mapping simpleString to Spark SQL types > --- > > Key: SPARK-30257 > URL: https://issues.apache.org/jira/browse/SPARK-30257 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.4 >Reporter: Shelby Vanhooser >Priority: Critical > > The PySpark mapping from simpleString to Spark SQL types are too manual right > now; instead, pyspark.sql.types should expose a method that maps the > simpleString representation of these types to the underlying Spark SQL ones > > Tracked here : [https://github.com/apache/spark/pull/26884] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30257) Mapping simpleString to Spark SQL types
Shelby Vanhooser created SPARK-30257: Summary: Mapping simpleString to Spark SQL types Key: SPARK-30257 URL: https://issues.apache.org/jira/browse/SPARK-30257 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.4.4 Reporter: Shelby Vanhooser The PySpark mapping from simpleString to Spark SQL types are too manual right now; instead, pyspark.sql.types should expose a method that maps the simpleString representation of these types to the underlying Spark SQL ones -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995815#comment-16995815 ] Nasir Ali edited comment on SPARK-28502 at 12/13/19 6:45 PM: - {code:java} import numpy as np import pandas as pd import json from geopy.distance import great_circle from pyspark.sql.functions import pandas_udf, PandasUDFType from shapely.geometry.multipoint import MultiPoint from sklearn.cluster import DBSCAN from pyspark.sql.types import StructField, StructType, StringType, FloatType, MapType from pyspark.sql.types import StructField, StructType, StringType, FloatType, TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([ StructField("timestamp", TimestampType()), StructField("window", StructType([ StructField("start", TimestampType()), StructField("end", TimestampType())])), StructField("some_val", StringType()) ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP) def get_win_col(key, user_data): all_vals = [] for index, row in user_data.iterrows(): all_vals.append([row["timestamp"],key[2],"tesss"]) return pd.DataFrame(all_vals,columns=['timestamp','window','some_val']) {code} I am not even able to manually return window column. It throws error {code:java} Traceback (most recent call last): File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in returnType to_arrow_type(self._returnType_placeholder) File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, in to_arrow_type raise TypeError("Nested StructType not supported in conversion to Arrow") TypeError: Nested StructType not supported in conversion to ArrowDuring handling of the above exception, another exception occurred:Traceback (most recent call last): File "", line 1, in File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in _create_udf return udf_obj._wrapped() File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in _wrapped wrapper.returnType = self.returnType File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in returnType "%s is not supported" % str(self._returnType_placeholder)) NotImplementedError: Invalid returnType with grouped map Pandas UDFs: StructType(List(StructField(timestamp,TimestampType,true),StructField(window,StructType(List(StructField(start,TimestampType,true),StructField(end,TimestampType,true))),true),StructField(some_val,StringType,true))) is not supported {code} However, if I manually run *to_arrow_schema(schema)*. It works all fine and there is no exception. [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L139] {code:java} from pyspark.sql.types import to_arrow_schema to_arrow_schema(schema) {code} was (Author: nasirali): {code:java} import numpy as np import pandas as pd import json from geopy.distance import great_circle from pyspark.sql.functions import pandas_udf, PandasUDFType from shapely.geometry.multipoint import MultiPoint from sklearn.cluster import DBSCAN from pyspark.sql.types import StructField, StructType, StringType, FloatType, MapType from pyspark.sql.types import StructField, StructType, StringType, FloatType, TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([ StructField("timestamp", TimestampType()), StructField("window", StructType([ StructField("start", TimestampType()), StructField("end", TimestampType())])), StructField("some_val", StringType()) ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP) def get_win_col(key, user_data): all_vals = [] for index, row in user_data.iterrows(): all_vals.append([row["timestamp"],key[2],"tesss"]) return pd.DataFrame(all_vals,columns=['timestamp','window','some_val']) {code} I am not even able to manually return window column. It throws error {code:java} Traceback (most recent call last): File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in returnType to_arrow_type(self._returnType_placeholder) File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, in to_arrow_type raise TypeError("Nested StructType not supported in conversion to Arrow") TypeError: Nested StructType not supported in conversion to ArrowDuring handling of the above exception, another exception occurred:Traceback (most recent call last): File "", line 1, in File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in _create_udf return udf_obj._wrapped() File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in _wrapped wrapper.returnType = self.returnType File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in returnType "%s is not supported" % str(self._returnType_placeholder)) NotImplementedError: Invalid returnType wi
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995815#comment-16995815 ] Nasir Ali commented on SPARK-28502: --- {code:java} import numpy as np import pandas as pd import json from geopy.distance import great_circle from pyspark.sql.functions import pandas_udf, PandasUDFType from shapely.geometry.multipoint import MultiPoint from sklearn.cluster import DBSCAN from pyspark.sql.types import StructField, StructType, StringType, FloatType, MapType from pyspark.sql.types import StructField, StructType, StringType, FloatType, TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([ StructField("timestamp", TimestampType()), StructField("window", StructType([ StructField("start", TimestampType()), StructField("end", TimestampType())])), StructField("some_val", StringType()) ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP) def get_win_col(key, user_data): all_vals = [] for index, row in user_data.iterrows(): all_vals.append([row["timestamp"],key[2],"tesss"]) return pd.DataFrame(all_vals,columns=['timestamp','window','some_val']) {code} I am not even able to manually return window column. It throws error {code:java} Traceback (most recent call last): File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in returnType to_arrow_type(self._returnType_placeholder) File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, in to_arrow_type raise TypeError("Nested StructType not supported in conversion to Arrow") TypeError: Nested StructType not supported in conversion to ArrowDuring handling of the above exception, another exception occurred:Traceback (most recent call last): File "", line 1, in File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in _create_udf return udf_obj._wrapped() File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in _wrapped wrapper.returnType = self.returnType File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in returnType "%s is not supported" % str(self._returnType_placeholder)) NotImplementedError: Invalid returnType with grouped map Pandas UDFs: StructType(List(StructField(timestamp,TimestampType,true),StructField(window,StructType(List(StructField(start,TimestampType,true),StructField(end,TimestampType,true))),true),StructField(some_val,StringType,true))) is not supported {code} However, if I manually run *to_arrow_schema(schema)*. It works all fine and there is no exception. {code:java} from pyspark.sql.types import to_arrow_schema to_arrow_schema(schema) {code} > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days"
[jira] [Updated] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit
[ https://issues.apache.org/jira/browse/SPARK-30256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Evans updated SPARK-30256: --- Description: It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as user X" option. This way, multi-tenant applications that run Spark jobs could give end users greater security, by ensuring that the files (including, importantly, keytabs) can remain readable only by the end users instead of the UID that runs this multi-tenant application itself. I believe that {{sudo -u spark-submit }} should work. The builder maintained by {{SparkLauncher}} could simply have a {{setSudoUser}} method. (was: It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as user X" option. This way, multi-tenant applications that run Spark jobs could give end users greater security, by ensuring that the files (including, importantly, keytabs) can remain readable only by the end users instead of the UID that runs this multi-tenant application itself. I believe that {{sudo -u spark-submit Allow SparkLauncher to sudo before executing spark-submit > - > > Key: SPARK-30256 > URL: https://issues.apache.org/jira/browse/SPARK-30256 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Jeff Evans >Priority: Minor > > It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for > a "sudo as user X" option. This way, multi-tenant applications that run > Spark jobs could give end users greater security, by ensuring that the files > (including, importantly, keytabs) can remain readable only by the end users > instead of the UID that runs this multi-tenant application itself. I believe > that {{sudo -u spark-submit }} should work. The > builder maintained by {{SparkLauncher}} could simply have a {{setSudoUser}} > method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit
Jeff Evans created SPARK-30256: -- Summary: Allow SparkLauncher to sudo before executing spark-submit Key: SPARK-30256 URL: https://issues.apache.org/jira/browse/SPARK-30256 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 3.0.0 Reporter: Jeff Evans It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as user X" option. This way, multi-tenant applications that run Spark jobs could give end users greater security, by ensuring that the files (including, importantly, keytabs) can remain readable only by the end users instead of the UID that runs this multi-tenant application itself. I believe that {{sudo -u spark-submit
[jira] [Commented] (SPARK-30168) Eliminate warnings in Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-30168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995754#comment-16995754 ] Maxim Gekk commented on SPARK-30168: [~Ankitraj] Go ahead. > Eliminate warnings in Parquet datasource > > > Key: SPARK-30168 > URL: https://issues.apache.org/jira/browse/SPARK-30168 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > # > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala > {code} > Warning:Warning:line (120)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > Option[TimeZone]) => RecordReader[Void, T]): RecordReader[Void, T] > = { > Warning:Warning:line (125)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > new org.apache.parquet.hadoop.ParquetInputSplit( > Warning:Warning:line (134)method readFooter in class ParquetFileReader is > deprecated: see corresponding Javadoc for more information. > ParquetFileReader.readFooter(conf, filePath, > SKIP_ROW_GROUPS).getFileMetaData > Warning:Warning:line (183)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > split: ParquetInputSplit, > Warning:Warning:line (212)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > split: ParquetInputSplit, > {code} > # > sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java > {code} > Warning:Warning:line (55)java: org.apache.parquet.hadoop.ParquetInputSplit in > org.apache.parquet.hadoop has been deprecated > Warning:Warning:line (95)java: > org.apache.parquet.hadoop.ParquetInputSplit in org.apache.parquet.hadoop has > been deprecated > Warning:Warning:line (95)java: > org.apache.parquet.hadoop.ParquetInputSplit in org.apache.parquet.hadoop has > been deprecated > Warning:Warning:line (97)java: getRowGroupOffsets() in > org.apache.parquet.hadoop.ParquetInputSplit has been deprecated > Warning:Warning:line (105)java: > readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter) > in org.apache.parquet.hadoop.ParquetFileReader has been deprecated > Warning:Warning:line (108)java: > filterRowGroups(org.apache.parquet.filter2.compat.FilterCompat.Filter,java.util.List,org.apache.parquet.schema.MessageType) > in org.apache.parquet.filter2.compat.RowGroupFilter has been deprecated > Warning:Warning:line (111)java: > readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter) > in org.apache.parquet.hadoop.ParquetFileReader has been deprecated > Warning:Warning:line (147)java: > ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.parquet.hadoop.metadata.FileMetaData,org.apache.hadoop.fs.Path,java.util.List,java.util.List) > in org.apache.parquet.hadoop.ParquetFileReader has been deprecated > Warning:Warning:line (203)java: > readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter) > in org.apache.parquet.hadoop.ParquetFileReader has been deprecated > Warning:Warning:line (226)java: > ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.parquet.hadoop.metadata.FileMetaData,org.apache.hadoop.fs.Path,java.util.List,java.util.List) > in org.apache.parquet.hadoop.ParquetFileReader has been deprecated > {code} > # > sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompatibilityTest.scala > # > sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala > # > sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTest.scala > # > sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30243) Upgrade K8s client dependency to 4.6.4
[ https://issues.apache.org/jira/browse/SPARK-30243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30243: - Assignee: Dongjoon Hyun > Upgrade K8s client dependency to 4.6.4 > -- > > Key: SPARK-30243 > URL: https://issues.apache.org/jira/browse/SPARK-30243 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30243) Upgrade K8s client dependency to 4.6.4
[ https://issues.apache.org/jira/browse/SPARK-30243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30243. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26874 [https://github.com/apache/spark/pull/26874] > Upgrade K8s client dependency to 4.6.4 > -- > > Key: SPARK-30243 > URL: https://issues.apache.org/jira/browse/SPARK-30243 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30248) DROP TABLE doesn't work if session catalog name is provided
[ https://issues.apache.org/jira/browse/SPARK-30248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30248. - Fix Version/s: 3.0.0 Assignee: Terry Kim Resolution: Fixed > DROP TABLE doesn't work if session catalog name is provided > --- > > Key: SPARK-30248 > URL: https://issues.apache.org/jira/browse/SPARK-30248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > If a table name is qualified with session catalog name ("spark_catalog"), the > DROP TABLE command fails. > For example, the following > {code:java} > sql("CREATE TABLE tbl USING json AS SELECT 1 AS i") > sql("DROP TABLE spark_catalog.tbl") > {code} > fails with: > {code:java} > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database > 'spark_catalog' not found; >at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42) >at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) >at > org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:45) >at > org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:336) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries
[ https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995627#comment-16995627 ] Xiaoju Wu commented on SPARK-30072: --- [~cloud_fan] If the sql looks like: SELECT * FROM df2 WHERE df2.k = (SELECT max(df2.k) FROM df1 JOIN df2 ON df1.k = df2.k AND df2.id < 2) The nested subquery "SELECT max(df2.k) FROM df1 JOIN df2 ON df1.k = df2.k AND df2.id < 2" will be run in another QueryExecution, there's no way to pass "isSubquery" information to InsertAdaptiveSparkPlan in the nested QueryExecution. > Create dedicated planner for subqueries > --- > > Key: SPARK-30072 > URL: https://issues.apache.org/jira/browse/SPARK-30072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Minor > Fix For: 3.0.0 > > > This PR changes subquery planning by calling the planner and plan preparation > rules on the subquery plan directly. Before we were creating a QueryExecution > instance for subqueries to get the executedPlan. This would re-run analysis > and optimization on the subqueries plan. Running the analysis again on an > optimized query plan can have unwanted consequences, as some rules, for > example DecimalPrecision, are not idempotent. > As an example, consider the expression 1.7 * avg(a) which after applying the > DecimalPrecision rule becomes: > promote_precision(1.7) * promote_precision(avg(a)) > After the optimization, more specifically the constant folding rule, this > expression becomes: > 1.7 * promote_precision(avg(a)) > Now if we run the analyzer on this optimized query again, we will get: > promote_precision(1.7) * promote_precision(promote_precision(avg(a))) > Which will later optimized as: > 1.7 * promote_precision(promote_precision(avg(a))) > As can be seen, re-running the analysis and optimization on this expression > results in an expression with extra nested promote_preceision nodes. Adding > unneeded nodes to the plan is problematic because it can eliminate situations > where we can reuse the plan. > We opted to introduce dedicated planners for subuqueries, instead of making > the DecimalPrecision rule idempotent, because this eliminates this entire > category of problems. Another benefit is that planning time for subqueries is > reduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30255) Support explain mode in SparkR df.explain
Takeshi Yamamuro created SPARK-30255: Summary: Support explain mode in SparkR df.explain Key: SPARK-30255 URL: https://issues.apache.org/jira/browse/SPARK-30255 Project: Spark Issue Type: Improvement Components: R, SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro This pr intends to support explain modes implemented in SPARK-30200(#26829) for SparkR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30227) Add close() on DataWriter interface
[ https://issues.apache.org/jira/browse/SPARK-30227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30227. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26855 [https://github.com/apache/spark/pull/26855] > Add close() on DataWriter interface > --- > > Key: SPARK-30227 > URL: https://issues.apache.org/jira/browse/SPARK-30227 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > If the scaladoc of DataWriter is correct, the lifecycle of DataWriter > instance ends at either commit() or abort(). That makes datasource > implementors to feel they can place resource cleanup in both sides, but > abort() can be called when commit() fails; so they have to ensure they don't > do double-cleanup if cleanup is not idempotent. > So I'm proposing to add close() on DataWriter explicitly, which is "the > place" for resource cleanup. The lifecycle of DataWriter instance will (and > should) end at close(). > I've checked some callers to see whether they can apply "try-catch-finally" > to ensure close() is called at the end of lifecycle for DataWriter, and they > look like so. > The change would bring backward incompatible change, but given the interface > is marked as Evolving and we're making backward incompatible changes in Spark > 3.0, so I feel it may not matter. > I've raised the discussion around this issue and the feedbacks are positive: > https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30227) Add close() on DataWriter interface
[ https://issues.apache.org/jira/browse/SPARK-30227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30227: --- Assignee: Jungtaek Lim > Add close() on DataWriter interface > --- > > Key: SPARK-30227 > URL: https://issues.apache.org/jira/browse/SPARK-30227 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > If the scaladoc of DataWriter is correct, the lifecycle of DataWriter > instance ends at either commit() or abort(). That makes datasource > implementors to feel they can place resource cleanup in both sides, but > abort() can be called when commit() fails; so they have to ensure they don't > do double-cleanup if cleanup is not idempotent. > So I'm proposing to add close() on DataWriter explicitly, which is "the > place" for resource cleanup. The lifecycle of DataWriter instance will (and > should) end at close(). > I've checked some callers to see whether they can apply "try-catch-finally" > to ensure close() is called at the end of lifecycle for DataWriter, and they > look like so. > The change would bring backward incompatible change, but given the interface > is marked as Evolving and we're making backward incompatible changes in Spark > 3.0, so I feel it may not matter. > I've raised the discussion around this issue and the feedbacks are positive: > https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29741) Spark Application UI- In Environment tab add "Search" option
[ https://issues.apache.org/jira/browse/SPARK-29741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29741. -- Resolution: Won't Fix > Spark Application UI- In Environment tab add "Search" option > > > Key: SPARK-29741 > URL: https://issues.apache.org/jira/browse/SPARK-29741 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Spark Application UI- environment tab add "Search" option. > As there are different sections in Environment tab now for information's and > properties like > Runtime Information,Spark Properties,Hadoop Properties, System Properties & > Classpath Entries,better to give one *Search* field .So it wil be easy to > search any parameter value even though we don't know in which section it will > come. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30079) Tests fail in environments with locale different from en_US
[ https://issues.apache.org/jira/browse/SPARK-30079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30079. -- Resolution: Not A Problem > Tests fail in environments with locale different from en_US > --- > > Key: SPARK-30079 > URL: https://issues.apache.org/jira/browse/SPARK-30079 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 3.0.0 > Environment: any environment, with non-english locale and/or > different separators for numbers. >Reporter: Lukas Menzel >Priority: Trivial > > Tests fail on systems with different locale than en_US. > Assertions regarding messages of exceptions fail, because they are localized > by Java depending on the system environment. (e.g > org.apache.spark.deploy.SparkSubmitSuite) > Other tests fail because of assertions about formatted numbers, which use a > different separators (see > [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html]) > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995538#comment-16995538 ] Attila Zsolt Piros commented on SPARK-27021: [~roncenzhao] # yes # I do not think so. This bug mostly effects the test system as test execution is the place where multiple TransportContext, NettyRpcEnv, etc ... are created and not closed correctly. > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30208) A race condition when reading from Kafka in PySpark
[ https://issues.apache.org/jira/browse/SPARK-30208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995532#comment-16995532 ] Jungtaek Lim commented on SPARK-30208: -- I've just tested it simply with additional logging: {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.appName("TaskCompletionListenerTesting").getOrCreate() df = spark \ .read \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "topic1") \ .load() def f(rows): for row in rows: print(row.key) df.foreachPartition(f) {code} and no, KafkaRDD registers earlier than PythonRunner which would mean callback from PythonRunner will be called earlier. It sounds natural as KafkaRDD is a data source hence should be placed first. (I can't imagine the other case) So my guess seems wrong; there's another slightly possible case - complete callback of PythonRunner doesn't even join the writer thread - but given this is about race-condition so I'm not 100% sure. > A race condition when reading from Kafka in PySpark > --- > > Key: SPARK-30208 > URL: https://issues.apache.org/jira/browse/SPARK-30208 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Jiawen Zhu >Priority: Major > > When using PySpark to read from Kafka, there is a race condition that Spark > may use KafkaConsumer in multiple threads at the same time and throw the > following error: > {code} > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:2215) > at > kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2104) > at > kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2059) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.close(KafkaDataConsumer.scala:451) > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$NonCachedKafkaDataConsumer.release(KafkaDataConsumer.scala:508) > at > org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.close(KafkaSourceRDD.scala:126) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66) > at > org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:131) > at > org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:130) > at > org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:162) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:144) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:142) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:142) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:130) > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:155) > at org.apache.spark.scheduler.Task.run(Task.scala:112) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > When using PySpark, reading from Kafka is actually happening in a separate > writer thread rather that the task thread. When a task is early terminated > (e.g., there is a limit operator), the task thread may stop the KafkaConsumer > when the writer thread is using it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995531#comment-16995531 ] pavithra ramachandran commented on SPARK-28825: --- [~dkbiswal] are you working on this? If not I would like to work on this. > Document EXPLAIN Statement in SQL Reference. > > > Key: SPARK-28825 > URL: https://issues.apache.org/jira/browse/SPARK-28825 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30254) Fix use custom escape lead to LikeSimplification optimize failed
ulysses you created SPARK-30254: --- Summary: Fix use custom escape lead to LikeSimplification optimize failed Key: SPARK-30254 URL: https://issues.apache.org/jira/browse/SPARK-30254 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ulysses you We should also sync the escape used by `LikeSimplification`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30253) Do not add commits when releasing preview version
[ https://issues.apache.org/jira/browse/SPARK-30253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30253: Attachment: 3.0.0-preview.png > Do not add commits when releasing preview version > - > > Key: SPARK-30253 > URL: https://issues.apache.org/jira/browse/SPARK-30253 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: 3.0.0-preview.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30253) Do not add commits when releasing preview version
[ https://issues.apache.org/jira/browse/SPARK-30253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30253: Summary: Do not add commits when releasing preview version (was: Preview release does not add version change commits to master branch) > Do not add commits when releasing preview version > - > > Key: SPARK-30253 > URL: https://issues.apache.org/jira/browse/SPARK-30253 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30253) Preview release does not add version change commits to master branch
Yuming Wang created SPARK-30253: --- Summary: Preview release does not add version change commits to master branch Key: SPARK-30253 URL: https://issues.apache.org/jira/browse/SPARK-30253 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30252) Disallow negative scale of Decimal under ansi mode
wuyi created SPARK-30252: Summary: Disallow negative scale of Decimal under ansi mode Key: SPARK-30252 URL: https://issues.apache.org/jira/browse/SPARK-30252 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: wuyi According to SQL standard, {quote}4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. {quote} scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30231) Support explain mode in PySpark df.explain
[ https://issues.apache.org/jira/browse/SPARK-30231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30231. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26861 [https://github.com/apache/spark/pull/26861] > Support explain mode in PySpark df.explain > -- > > Key: SPARK-30231 > URL: https://issues.apache.org/jira/browse/SPARK-30231 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > This pr intends to support explain modes implemented in SPARK-30200(#26829) > for PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30231) Support explain mode in PySpark df.explain
[ https://issues.apache.org/jira/browse/SPARK-30231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30231: Assignee: Takeshi Yamamuro > Support explain mode in PySpark df.explain > -- > > Key: SPARK-30231 > URL: https://issues.apache.org/jira/browse/SPARK-30231 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > This pr intends to support explain modes implemented in SPARK-30200(#26829) > for PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30251) faster way to read csv.gz?
[ https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-30251: - Description: some data providers give files in csv.gz (ie 1gb compressed which is 25gb uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed which is 2.5gb uncompressed), now when i tell my boss that famous big data tool spark takes 16hrs to convert the 1gb compressed into parquet then there is look of shock. this is batch data we receive daily (80gb compressed, 2tb uncompressed every day spread across ~300 files). i know gz is not splittable so it ends up loaded on single worker. but we dont have space/patience to do a pre-conversion to bz2 or uncompressed. can spark have a better codec? i saw posts mentioning even python is faster than spark [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] [https://github.com/nielsbasjes/splittablegzip] was: some data providers give files in csv.gz (ie 1gb compressed which is 25gb uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed which is 2.5gb uncompressed), now when i tell my boss that famous big data tool spark takes 16hrs to convert the 1gb compressed into parquet then there is look of shock. this is batch data we receive daily (80gb compressed, 2tb uncompressed every day spread across ~300 files). i know gz is not splittable so currently loaded on single worker. but we dont have space/patience to do a pre-conversion to bz2 or uncompressed. can spark have a better codec? i saw posts mentioning even python is faster than spark [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] [https://github.com/nielsbasjes/splittablegzip] > faster way to read csv.gz? > -- > > Key: SPARK-30251 > URL: https://issues.apache.org/jira/browse/SPARK-30251 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: t oo >Priority: Major > > some data providers give files in csv.gz (ie 1gb compressed which is 25gb > uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed > which is 2.5gb uncompressed), now when i tell my boss that famous big data > tool spark takes 16hrs to convert the 1gb compressed into parquet then there > is look of shock. this is batch data we receive daily (80gb compressed, 2tb > uncompressed every day spread across ~300 files). > i know gz is not splittable so it ends up loaded on single worker. but we > dont have space/patience to do a pre-conversion to bz2 or uncompressed. can > spark have a better codec? i saw posts mentioning even python is faster than > spark > > [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] > [https://github.com/nielsbasjes/splittablegzip] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org