date:20191213

[jira] [Updated] (SPARK-30262) Fix NumberFormatException when totalSize is empty

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30262:
--
Description: 
For Spark2.3.0+, we could get the Partitions Statistics Info.But in some 
specail case, The Info  like  totalSize，rawDataSize，rowCount maybe empty. When 
we do some ddls like   
{code:java}
desc formatted partitions {code}
 ,the NumberFormatException is showed as below:
{code:java}
spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', 
hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 
partition(year='2019', month='10', day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124)
{code}

  was:
For Spark2.3.0+, we could get the Partitions Statistics Info.But in some 
specail case, The Info  like  totalSize，rawDataSize，rowCount maybe empty. When 
we do some ddls like   
{code:java}
desc formatted partitions {code}
 

,the NumberFormatException is showed as below:
{code:java}
spark-sql> desc formatted gulfstream.ods_binlog_business_config_whole 
partition(year='2019', month='10', day='17', hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted 
gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', 
day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219)
at 
org.apache.spark

[jira] [Updated] (SPARK-30262) Fix NumberFormatException when totalSize is empty

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30262:
--
Description: 
For Spark2.3.0+, we could get the Partitions Statistics Info.But in some 
specail case, The Info  like  totalSize，rawDataSize，rowCount maybe empty. When 
we do some ddls like   
{code:java}
desc formatted partition{code}
 ,the NumberFormatException is showed as below:
{code:java}
spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', 
hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 
partition(year='2019', month='10', day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124)
{code}

  was:
For Spark2.3.0+, we could get the Partitions Statistics Info.But in some 
specail case, The Info  like  totalSize，rawDataSize，rowCount maybe empty. When 
we do some ddls like   
{code:java}
desc formatted partitions {code}
 ,the NumberFormatException is showed as below:
{code:java}
spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', 
hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 
partition(year='2019', month='10', day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218)
at 
org.a

[jira] [Created] (SPARK-30262) Fix NumberFormatException when totalSize is empty

2019-12-13 Thread chenliang (Jira)

chenliang created SPARK-30262:
-

 Summary:  Fix NumberFormatException when totalSize is empty
 Key: SPARK-30262
 URL: https://issues.apache.org/jira/browse/SPARK-30262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: chenliang
 Fix For: 2.4.3, 2.3.2


For Spark2.3.0+, we could get the Partitions Statistics Info.But in some 
specail case, The Info  like  totalSize，rawDataSize，rowCount maybe empty. When 
we do some ddls like   
{code:java}
desc formatted partitions {code}
 

,the NumberFormatException is showed as below:
{code:java}
spark-sql> desc formatted gulfstream.ods_binlog_business_config_whole 
partition(year='2019', month='10', day='17', hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted 
gulfstream.ods_binlog_business_config_whole partition(year='2019', month='10', 
day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1056)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1048)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1$$anonfun$apply$16.apply(HiveClientImpl.scala:659)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:659)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionOption$1.apply(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:219)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:218)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:264)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:656)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartitionOption(HiveClient.scala:194)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionOption(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartition(HiveClient.scala:174)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartition(HiveClientImpl.scala:84)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1125)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getPartition$1.apply(HiveExternalCatalog.scala:1124)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30261:
--
Description: 
For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.

The problem can be reproduced as described in the below:

1.First I create a table with username='xie' and then \{{desc formatted table 
}},the owner is 'xiepengjie'
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test Table tt1 
Owner xie 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s)

{code}
 2.Then I use another username='johnchen' and execute {{alter table 
bigdata_test.tt1 set location 
'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner 
of hive table is 'johnchen', it's unresonable
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test 
Table tt1 
Owner johnchen 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog 
Time taken: 0.041 seconds, Fetched 18 row(s){code}
 

  was:
For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.

The problem can be reproduced as described in the below:

1.First I create a table with username='xie' and then \{{desc formatted table 
}},the owner is 'xiepengjie'
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test Table tt1 
Owner xie 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s)




{code}
 
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test 
Table tt1 
Owner johnchen 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog 
Time taken: 0.041 seconds, Fetched 18 row(s){code}
{{}}


> Should not change owner of hive table  for  some commands like 'alter' 
> operation
> 
>
> Key: SPARK-30261
> URL: https://issues.apache.org/jira/browse/SPARK-30261
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3
>Reporter: chenliang
>Priority: Critical
> Fix For: 2.2.0, 2.3.0, 2.4.3
>

[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30261:
--
Description: 
For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.

The problem can be reproduced as described in the below:

1.First I create a table with username='xie' and then \{{desc formatted table 
}},the owner is 'xiepengjie'
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test Table tt1 
Owner xie 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s)




{code}
 
{code:java}
spark-sql> desc formatted bigdata_test.tt1; 
col_name data_type comment c int NULL 
# Detailed Table Information 
Database bigdata_test 
Table tt1 
Owner johnchen 
Created Time Wed Sep 11 11:30:49 CST 2019 
Last Access Thu Jan 01 08:00:00 CST 1970 
Created By Spark 2.2 or prior 
Type MANAGED 
Provider hive 
Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, 
TTL=60] 
Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Storage Properties [serialization.format=1] 
Partition Provider Catalog 
Time taken: 0.041 seconds, Fetched 18 row(s){code}
{{}}

  was:
For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.

The problem can be reproduced as described in the below:
 # First I create a table with username='xie' and then {{desc formatted table 
}},the owner is 'xiepengjie'

 

{{}}
{code:java}
spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie 
Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 
1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties 
[PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location 
hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library 
org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties 
[serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, 
Fetched 18 row(s){code}
{{}}
 # Then I use another username='johnchen' and execute {{alter table 
bigdata_test.tt1 set location 
'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner 
of hive table is 'johnchen', it's unresonable

 

{{}}
{code:java}
spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
NULL # Detailed Table Information Database bigdata_test Table tt1 Owner 
johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 
08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive 
Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, 
TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde 
Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties 
[serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, 
Fetched 18 row(s){code}
{{}}


> Should not change owner of hive table  for  some commands like 'alter' 
> operation
> 
>
> Key: SPARK-30261
> URL: https://issues.apache.org/jira/browse/SPARK-30261
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3
>Reporter: chenliang
>Priority: Critical
> Fix For: 2.2.0, 2.3.0,

[jira] [Updated] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30261:
--
Description: 
For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.

The problem can be reproduced as described in the below:
 # First I create a table with username='xie' and then {{desc formatted table 
}},the owner is 'xiepengjie'

 

{{}}
{code:java}
spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie 
Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 
1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table Properties 
[PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, TTL=60] Location 
hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde Library 
org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties 
[serialization.format=1] Partition Provider Catalog Time taken: 0.371 seconds, 
Fetched 18 row(s){code}
{{}}
 # Then I use another username='johnchen' and execute {{alter table 
bigdata_test.tt1 set location 
'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner 
of hive table is 'johnchen', it's unresonable

 

{{}}
{code:java}
spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
NULL # Detailed Table Information Database bigdata_test Table tt1 Owner 
johnchen Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 
08:00:00 CST 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive 
Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, LEVEL=1, 
TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 Serde 
Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties 
[serialization.format=1] Partition Provider Catalog Time taken: 0.041 seconds, 
Fetched 18 row(s){code}
{{}}

  was:For SparkSQL,When we do some alter operations on hive table, the owner of 
hive table would be changed to someone who invoked the operation, it's 
unresonable. And in fact, the owner should not changed for the real prodcution 
environment, otherwise the  authority check is out of order.


> Should not change owner of hive table  for  some commands like 'alter' 
> operation
> 
>
> Key: SPARK-30261
> URL: https://issues.apache.org/jira/browse/SPARK-30261
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3
>Reporter: chenliang
>Priority: Critical
> Fix For: 2.2.0, 2.3.0, 2.4.3
>
>
> For SparkSQL,When we do some alter operations on hive table, the owner of 
> hive table would be changed to someone who invoked the operation, it's 
> unresonable. And in fact, the owner should not changed for the real 
> prodcution environment, otherwise the  authority check is out of order.
> The problem can be reproduced as described in the below:
>  # First I create a table with username='xie' and then {{desc formatted table 
> }},the owner is 'xiepengjie'
>  
> {{}}
> {code:java}
> spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
> NULL # Detailed Table Information Database bigdata_test Table tt1 Owner xie 
> Created Time Wed Sep 11 11:30:49 CST 2019 Last Access Thu Jan 01 08:00:00 CST 
> 1970 Created By Spark 2.2 or prior Type MANAGED Provider hive Table 
> Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, LEVEL=1, 
> TTL=60] Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties 
> [serialization.format=1] Partition Provider Catalog Time taken: 0.371 
> seconds, Fetched 18 row(s){code}
> {{}}
>  # Then I use another username='johnchen' and execute {{alter table 
> bigdata_test.tt1 set location 
> 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner 
> of hive table is 'johnchen', it's unresonable
>  
> {{}}
> {code:java}
> spark-sql> desc formatted bigdata_test.tt1; col_name data_type comment c int 
> NULL # Detailed Table Information Database bigdata_test Table tt1 Owner 
> johnchen Created Time Wed Sep 11 11

[jira] [Commented] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"

2019-12-13 Thread hehuiyuan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996224#comment-16996224
 ] 

hehuiyuan commented on SPARK-29940:
---

Hi , anyone ?

> Whether contains schema  for this parameter "spark.yarn.historyServer.address"
> --
>
> Key: SPARK-29940
> URL: https://issues.apache.org/jira/browse/SPARK-29940
> Project: Spark
>  Issue Type: Wish
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: hehuiyuan
>Priority: Minor
> Attachments: image-2019-11-18-15-44-10-358.png, 
> image-2019-11-18-15-45-33-295.png
>
>
>  
>   !image-2019-11-18-15-44-10-358.png|width=815,height=156!
>  
> !image-2019-11-18-15-45-33-295.png|width=673,height=273!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation

2019-12-13 Thread chenliang (Jira)

chenliang created SPARK-30261:
-

 Summary: Should not change owner of hive table  for  some commands 
like 'alter' operation
 Key: SPARK-30261
 URL: https://issues.apache.org/jira/browse/SPARK-30261
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 2.3.0, 2.2.0
Reporter: chenliang
 Fix For: 2.4.3, 2.3.0, 2.2.0


For SparkSQL,When we do some alter operations on hive table, the owner of hive 
table would be changed to someone who invoked the operation, it's unresonable. 
And in fact, the owner should not changed for the real prodcution environment, 
otherwise the  authority check is out of order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30250) SparkQL div is undocumented

2019-12-13 Thread Michael Chirico (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996209#comment-16996209
 ] 

Michael Chirico commented on SPARK-30250:
-

Doubly great news! Thanks

> SparkQL div is undocumented
> ---
>
> Key: SPARK-30250
> URL: https://issues.apache.org/jira/browse/SPARK-30250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Michael Chirico
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-15407
> Mentions the div operator in SparkQL.
> However, it's undocumented in the SQL API docs:
> https://spark.apache.org/docs/latest/api/sql/index.html
> It's documented in the HiveQL docs:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Target Version/s: 2.4.3, 2.3.0  (was: 2.4.3)

> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0, 2.4.3
>
>
> When we start spark-shell and use the udf for the first statement ,it's ok. 
> But for the other statements it failed to load jar to current classpath and 
> would throw ClassNotFoundException,the problem can be reproduced as described 
> in the below.
> {code:java}
> scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
>  --
>  |bigdata_test.Add(1, 2)|
>  --
>  | 3|
>  --
>  scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
>  org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
> line 1 pos 8
>    at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>    at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)
>    at scala.util.Try.getOrElse(Try.scala:79)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)
>    at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Fix Version/s: 2.3.0

> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0, 2.4.3
>
>
> When we start spark-shell and use the udf for the first statement ,it's ok. 
> But for the other statements it failed to load jar to current classpath and 
> would throw ClassNotFoundException,the problem can be reproduced as described 
> in the below.
> {code:java}
> scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
>  --
>  |bigdata_test.Add(1, 2)|
>  --
>  | 3|
>  --
>  scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
>  org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
> line 1 pos 8
>    at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>    at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)
>    at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)
>    at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)
>    at scala.util.Try.getOrElse(Try.scala:79)
>    at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)
>    at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Description: 
When we start spark-shell and use the udf for the first statement ,it's ok. But 
for the other statements it failed to load jar to current classpath and would 
throw ClassNotFoundException,the problem can be reproduced as described in the 
below.
 {{}}
{code:java}
scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 --
 |bigdata_test.Add(1, 2)|
 --
 | 3|
 --
 scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
line 1 pos 8
   at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)
   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)
   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)
   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)
   at scala.util.Try.getOrElse(Try.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)
   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code}

  was:
When we start spark-shell and use the udf for the first statement ,it's ok. But 
for the other statements it failed to load jar to current classpath and would 
throw ClassNotFoundException,the problem can be reproduced as described in the 
below.

 
{{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
{{+--+}}
{{|bigdata_test.Add(1, 2)|}}
{{+--+}}
{{| 3|}}
{{+--+}}
{{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
{{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
line 1 pos 8}}
{{  }}{{at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}}
{{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}}
{{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)}}
{{  }}{{at scala.util.Try.getOrElse(Try.scala:79)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)}}
{{  }}{{at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133)}}


> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.4.3
>
>
> When

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Description: 
When we start spark-shell and use the udf for the first statement ,it's ok. But 
for the other statements it failed to load jar to current classpath and would 
throw ClassNotFoundException,the problem can be reproduced as described in the 
below.
{code:java}
scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 --
 |bigdata_test.Add(1, 2)|
 --
 | 3|
 --
 scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
line 1 pos 8
   at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)
   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)
   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)
   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)
   at scala.util.Try.getOrElse(Try.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)
   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code}

  was:
When we start spark-shell and use the udf for the first statement ,it's ok. But 
for the other statements it failed to load jar to current classpath and would 
throw ClassNotFoundException,the problem can be reproduced as described in the 
below.
 {{}}
{code:java}
scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 --
 |bigdata_test.Add(1, 2)|
 --
 | 3|
 --
 scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()
 org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
line 1 pos 8
   at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)
   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)
   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)
   at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)
   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)
   at scala.util.Try.getOrElse(Try.scala:79)
   at 
org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)
   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133){code}


> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.4.3
>
>
> When we start spark-shell and use the udf for the first statement ,it's ok. 
> But for the other statements it failed to load

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Description: 
When we start spark-shell and use the udf for the first statement ,it's ok. But 
for the other statements it failed to load jar to current classpath and would 
throw ClassNotFoundException,the problem can be reproduced as described in the 
below.

 
{{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
{{+--+}}
{{|bigdata_test.Add(1, 2)|}}
{{+--+}}
{{| 3|}}
{{+--+}}
{{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
{{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
line 1 pos 8}}
{{  }}{{at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}}
{{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}}
{{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}}
{{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:71)}}
{{  }}{{at scala.util.Try.getOrElse(Try.scala:79)}}
{{  }}{{at 
org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:71)}}
{{  }}{{at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1133)}}

  was:
When we start spark-shell  and use the udf for the first statement ,it's ok. 
But for  the other statements it failed to load jar to current classpath and  
would throw ClassNotFoundException,the problem can be reproduced as described 
in the below.





> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.4.3
>
>
> When we start spark-shell and use the udf for the first statement ,it's ok. 
> But for the other statements it failed to load jar to current classpath and 
> would throw ClassNotFoundException,the problem can be reproduced as described 
> in the below.
>  
> {{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
> {{+--+}}
> {{|bigdata_test.Add(1, 2)|}}
> {{+--+}}
> {{| 3|}}
> {{+--+}}
> {{scala> val res = spark.sql("select  bigdata_test.Add(1,2)").show()}}
> {{org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'scala.didi.udf.Add': java.lang.ClassNotFoundException: scala.didi.udf.Add; 
> line 1 pos 8}}
> {{  }}{{at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)}}
> {{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:424)}}
> {{  }}{{at java.lang.ClassLoader.loadClass(ClassLoader.java:357)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:251)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:56)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:56)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:60)}}
> {{  }}{{at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:59)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:77)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:77)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$3.apply(HiveSessionCatalog.scala:79)}}
> {{  }}{{at 
> org.apache.spark.sql.hive.Hiv

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Description: 
When we start spark-shell  and use the udf for the first statement ,it's ok. 
But for  the other statements it failed to load jar to current classpath and  
would throw ClassNotFoundException,the problem can be reproduced as described 
in the below.




  was:When we start spark-shell  and use the udf for the first statement ,it's 
ok. But for  the other statements it failed to load jar to current classpath 
and  would throw 


> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.4.3
>
>
> When we start spark-shell  and use the udf for the first statement ,it's ok. 
> But for  the other statements it failed to load jar to current classpath and  
> would throw ClassNotFoundException,the problem can be reproduced as described 
> in the below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-30260:
--
Description: When we start spark-shell  and use the udf for the first 
statement ,it's ok. But for  the other statements it failed to load jar to 
current classpath and  would throw 

> Spark-Shell throw ClassNotFoundException exception for more than one 
> statement to use UDF jar
> -
>
> Key: SPARK-30260
> URL: https://issues.apache.org/jira/browse/SPARK-30260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3, 2.4.4
>Reporter: chenliang
>Priority: Major
> Fix For: 2.4.3
>
>
> When we start spark-shell  and use the udf for the first statement ,it's ok. 
> But for  the other statements it failed to load jar to current classpath and  
> would throw 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30260) Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar

2019-12-13 Thread chenliang (Jira)

chenliang created SPARK-30260:
-

 Summary: Spark-Shell throw ClassNotFoundException exception for 
more than one statement to use UDF jar
 Key: SPARK-30260
 URL: https://issues.apache.org/jira/browse/SPARK-30260
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.4.4, 2.4.3, 2.3.0, 2.2.0
Reporter: chenliang
 Fix For: 2.4.3






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified

2019-12-13 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30259:
--
Description: 
Spark throw error when the session catalog is specified explicitly in "CREATE 
TABLE" and "CREATE TABLE AS SELECT" command, eg. 
{code:java}
CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
{code}
the error message is like below: 
{noformat}
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table : 
db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
cmd=get_database: spark_catalog 
19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
returning NoSuchObjectException
Error in query: Database 'spark_catalog' not found;{noformat}

  was:
Spark throw error when the session catalog is specified explicitly in "CREATE 
TABLE" and "CREATE TABLE AS SELECT" command, eg. 
{code:java}
// code placeholder
CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
{code}
the error message is like below: 
{noformat}
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table : 
db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
cmd=get_database: spark_catalog 
19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
returning NoSuchObjectException
Error in query: Database 'spark_catalog' not found;{noformat}


> CREATE TABLE throw error when session catalog specified
> ---
>
> Key: SPARK-30259
> URL: https://issues.apache.org/jira/browse/SPARK-30259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Spark throw error when the session catalog is specified explicitly in "CREATE 
> TABLE" and "CREATE TABLE AS SELECT" command, eg. 
> {code:java}
> CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
> {code}
> the error message is like below: 
> {noformat}
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table 
> : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
> cmd=get_database: spark_catalog 
> 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
> returning NoSuchObjectException
> Error in query: Database 'spark_catalog' not found;{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified

2019-12-13 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30259:
--
Description: 
Spark throw error when the session catalog is specified explicitly in "CREATE 
TABLE" and "CREATE TABLE AS SELECT" command, eg. 
{code:java}
// code placeholder
CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
{code}
the error message is like below: 
{noformat}
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table : 
db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
cmd=get_database: spark_catalog 
19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
returning NoSuchObjectException
Error in query: Database 'spark_catalog' not found;{noformat}

  was:
Spark throw error when the session catalog is specified explicitly in the 
CREATE TABLE AS SELECT command, eg. 
{code:java}
// code placeholder
CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
{code}
the error message is like below: 
{noformat}
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table : 
db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
cmd=get_database: spark_catalog 
19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
returning NoSuchObjectException
Error in query: Database 'spark_catalog' not found;{noformat}


> CREATE TABLE throw error when session catalog specified
> ---
>
> Key: SPARK-30259
> URL: https://issues.apache.org/jira/browse/SPARK-30259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Spark throw error when the session catalog is specified explicitly in "CREATE 
> TABLE" and "CREATE TABLE AS SELECT" command, eg. 
> {code:java}
> // code placeholder
> CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
> {code}
> the error message is like below: 
> {noformat}
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table 
> : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
> cmd=get_database: spark_catalog 
> 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
> returning NoSuchObjectException
> Error in query: Database 'spark_catalog' not found;{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30259) CREATE TABLE throw error when session catalog specified

2019-12-13 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30259:
--
Summary: CREATE TABLE throw error when session catalog specified  (was: 
CREATE TABLE AS SELECT throw error when session catalog specified)

> CREATE TABLE throw error when session catalog specified
> ---
>
> Key: SPARK-30259
> URL: https://issues.apache.org/jira/browse/SPARK-30259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Spark throw error when the session catalog is specified explicitly in the 
> CREATE TABLE AS SELECT command, eg. 
> {code:java}
> // code placeholder
> CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
> {code}
> the error message is like below: 
> {noformat}
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table 
> : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
> cmd=get_database: spark_catalog 
> 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
> returning NoSuchObjectException
> Error in query: Database 'spark_catalog' not found;{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30259) CREATE TABLE AS SELECT throw error when session catalog specified

2019-12-13 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30259:
--
Description: 
Spark throw error when the session catalog is specified explicitly in the 
CREATE TABLE AS SELECT command, eg. 
{code:java}
// code placeholder
CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
{code}
the error message is like below: 
{noformat}
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table : 
db=spark_catalog tbl=tbl
19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
cmd=get_database: spark_catalog 
19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
returning NoSuchObjectException
Error in query: Database 'spark_catalog' not found;{noformat}

> CREATE TABLE AS SELECT throw error when session catalog specified
> -
>
> Key: SPARK-30259
> URL: https://issues.apache.org/jira/browse/SPARK-30259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Spark throw error when the session catalog is specified explicitly in the 
> CREATE TABLE AS SELECT command, eg. 
> {code:java}
> // code placeholder
> CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
> {code}
> the error message is like below: 
> {noformat}
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  cmd=get_table 
> : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr  
> cmd=get_database: spark_catalog 
> 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, 
> returning NoSuchObjectException
> Error in query: Database 'spark_catalog' not found;{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30259) CREATE TABLE AS SELECT throw error when session catalog specified

2019-12-13 Thread Hu Fuwang (Jira)

Hu Fuwang created SPARK-30259:
-

 Summary: CREATE TABLE AS SELECT throw error when session catalog 
specified
 Key: SPARK-30259
 URL: https://issues.apache.org/jira/browse/SPARK-30259
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30190:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> HistoryServerDiskManager will fail on appStoreDir in s3
> ---
>
> Key: SPARK-30190
> URL: https://issues.apache.org/jira/browse/SPARK-30190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: thierry accart
>Priority: Major
>
> Hi
> While setting spark.eventLog.dir to s3a://... I realized that it *requires 
> destination directory to preexists for S3* 
> This is explained I think in HistoryServerDiskManager's appStoreDir: it tries 
> check if directory exists or can be created
> {{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new 
> IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}}
> But in S3, a directory does not exists and cannot be created: directories 
> don't exists by themselves, they are only materialized due to existence of 
> objects.
> Before proposing a patch, I wanted to know what are the prefered options : 
> should we have a spark option to skip the appStoreDir test, or skip it only 
> when a particular scheme is set, have a custom implementation of 
> HistoryServerDiskManager ...? 
>  
> _Note for people facing the {{IllegalArgumentException:}} {{Failed to create 
> app directory}} *you just have to put an empty file in bucket destination 
> 'path'*._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996084#comment-16996084
 ] 

Dongjoon Hyun commented on SPARK-30218:
---

Thank you for reporting with the investigation result, [~FC].

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-13 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996082#comment-16996082
 ] 

Evgenii edited comment on SPARK-23015 at 12/14/19 2:25 AM:
---

We invoke spark-submit from Java code in parallel too


was (Author: lartcev):
We invoke it from Java code in parallel too.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-13 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996082#comment-16996082
 ] 

Evgenii commented on SPARK-23015:
-

We invoke it from Java code in parallel too.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-13 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996080#comment-16996080
 ] 

Evgenii edited comment on SPARK-23015 at 12/14/19 2:23 AM:
---

Guys, why not to invoke %RANDOM% multiple times? Just change the 
spark-class2.cmd

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt

to

set 
LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%RANDOM%%RANDOM%.txt


was (Author: lartcev):
Guys, why not to invoke %RANDOM% multiple times? Just change the 
spark-class2.cmd

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt

to

set 
LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%-%RANDOM%-%RANDOM%.txt

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-13 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996080#comment-16996080
 ] 

Evgenii commented on SPARK-23015:
-

Guys, why not to invoke %RANDOM% multiple times? Just change the 
spark-class2.cmd

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt

to

set 
LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%-%RANDOM%-%RANDOM%.txt

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30218:
--
Affects Version/s: (was: 2.4.3)
   (was: 2.4.2)
   (was: 2.4.1)
   2.3.4

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30218:
--
Affects Version/s: 2.4.2
   2.4.3
   2.4.4

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17398) Failed to query on external JSon Partitioned table

2019-12-13 Thread Wing Yew Poon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wing Yew Poon reopened SPARK-17398:
---

This issue was never actually fixed. Evidently the problem still exists.
I'll create a PR with a fix.

> Failed to query on external JSon Partitioned table
> --
>
> Key: SPARK-17398
> URL: https://issues.apache.org/jira/browse/SPARK-17398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: screenshot-1.png
>
>
> 1. Create External Json partitioned table 
> with SerDe in hive-hcatalog-core-1.2.1.jar, download fom
> https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1
> 2. Query table meet exception, which works in spark1.5.2
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: 
> Lost task
>  0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: 
> java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord
> at 
> org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>  
> 3. Test Code
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object JsonBugs {
>   def main(args: Array[String]): Unit = {
> val table = "test_json"
> val location = "file:///g:/home/test/json"
> val create = s"""CREATE   EXTERNAL  TABLE  ${table}
>  (id string,  seq string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>  ALTER TABLE ${table} ADD 
>  PARTITION (index=1)LOCATION '${location}/index=1'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table)
> if (!exist) {
>   hctx.sql(create)
>   hctx.sql(add_part)
> } else {
>   hctx.sql("show partitions " + table).show()
> }
> hctx.sql("select * from test_json").show()
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30194) Re-enable checkstyle for Java

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30194.
---
Resolution: Won't Do

> Re-enable checkstyle for Java
> -
>
> Key: SPARK-30194
> URL: https://issues.apache.org/jira/browse/SPARK-30194
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30232) Fix the the ArthmeticException by zero when enable AQE

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30232.
-

> Fix the the ArthmeticException by zero when enable AQE
> --
>
> Key: SPARK-30232
> URL: https://issues.apache.org/jira/browse/SPARK-30232
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> Add a check for the divisor to avoid the ArthmeticException by zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30212) COUNT(DISTINCT) window function should be supported

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996067#comment-16996067
 ] 

Dongjoon Hyun commented on SPARK-30212:
---

Thank you for filing a JIRA, @Kernel Force .

> COUNT(DISTINCT) window function should be supported
> ---
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30212:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> COUNT(DISTINCT) window function should be supported
> ---
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30212:
--
Issue Type: Improvement  (was: Bug)

> COUNT(DISTINCT) window function should be supported
> ---
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>  Labels: SQL, distinct, window_function
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30212:
--
Labels:   (was: SQL distinct window_function)

> COUNT(DISTINCT) window function should be supported
> ---
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30181) Throws runtime exception when filter metastore partition key that's not string type or integral types

2019-12-13 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996065#comment-16996065
 ] 

L. C. Hsieh commented on SPARK-30181:
-

This should be fixed by SPARK-30238.

> Throws runtime exception when filter metastore partition key that's not 
> string type or integral types
> -
>
> Key: SPARK-30181
> URL: https://issues.apache.org/jira/browse/SPARK-30181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Yu-Jhe Li
>Priority: Major
>
> SQL below will throw a runtime exception since spark-2.4.0. I think it's a 
> bug brought from SPARK-22384 
> {code:scala}
> spark.sql("CREATE TABLE timestamp_part (value INT) PARTITIONED BY (dt 
> TIMESTAMP)")
> val df = Seq(
> (1, java.sql.Timestamp.valueOf("2019-12-01 00:00:00"), 1), 
> (2, java.sql.Timestamp.valueOf("2019-12-01 01:00:00"), 1)
>   ).toDF("id", "dt", "value")
> df.write.partitionBy("dt").mode("overwrite").saveAsTable("timestamp_part")
> spark.sql("select * from timestamp_part where dt >= '2019-12-01 
> 00:00:00'").explain(true)
> {code}
> {noformat}
> Caught Hive MetaException attempting to get partition metadata by filter from 
> Hive. You can set the Spark configuration setting 
> spark.sql.hive.manageFilesourcePartitions to false to work around this 
> problem, however this will result in degraded performance. Please report a 
> bug: https://issues.apache.org/jira/browse/SPARK
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:774)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.testMetastorePartitionFiltering(HiveClientSuite.scala:310)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$testMetastorePartitionFiltering(HiveClientSuite.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply$mcV$sp(HiveClientSuite.scala:105)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>

[jira] [Commented] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996064#comment-16996064
 ] 

Dongjoon Hyun commented on SPARK-30257:
---

Hi, [~svanhooser]. I triggered the full Spark Jenkins test on your PR.

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shelby Vanhooser
>Priority: Major
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30257:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shelby Vanhooser
>Priority: Major
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30257:
--
Priority: Major  (was: Critical)

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Major
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30257:
--
Labels:   (was: PySpark feature)

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Critical
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996046#comment-16996046
 ] 

Jelther Oliveira Gonçalves commented on SPARK-30242:


Hi [~dongjoon], thanks for the update.
I've seen you have changed already.

Thanks.

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30239) Creating a dataframe with Pandas rather than Numpy datatypes fails

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30239:
--
Summary: Creating a dataframe with Pandas rather than Numpy datatypes fails 
 (was: [Python] Creating a dataframe with Pandas rather than Numpy datatypes 
fails)

> Creating a dataframe with Pandas rather than Numpy datatypes fails
> --
>
> Key: SPARK-30239
> URL: https://issues.apache.org/jira/browse/SPARK-30239
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 
> | Scala 2.11
>Reporter: Philip Kahn
>Priority: Minor
>
> It's possible to work with DataFrames in Pandas and shuffle them back over to 
> Spark dataframes for processing; however, using Pandas extended datatypes 
> like {{Int64 }}( 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) 
> throws an error (that long / float can't be converted).
> This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} 
> allows only integers except for the single float value {{np.nan}}.
>  
> The current workaround for this is to use the columns as floats, and after 
> conversion to the Spark DataFrame, to recast the column as {{LongType()}}. 
> For example:
>  
> {{sdfC = spark.createDataFrame(kgridCLinked)}}
> {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
>  
> However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30242:
--
Component/s: (was: SQL)
 PySpark

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30239) [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30239:
--
Labels:   (was: easyfix)

> [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails
> ---
>
> Key: SPARK-30239
> URL: https://issues.apache.org/jira/browse/SPARK-30239
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 
> | Scala 2.11
>Reporter: Philip Kahn
>Priority: Minor
>
> It's possible to work with DataFrames in Pandas and shuffle them back over to 
> Spark dataframes for processing; however, using Pandas extended datatypes 
> like {{Int64 }}( 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) 
> throws an error (that long / float can't be converted).
> This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} 
> allows only integers except for the single float value {{np.nan}}.
>  
> The current workaround for this is to use the columns as floats, and after 
> conversion to the Spark DataFrame, to recast the column as {{LongType()}}. 
> For example:
>  
> {{sdfC = spark.createDataFrame(kgridCLinked)}}
> {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
>  
> However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996045#comment-16996045
 ] 

Dongjoon Hyun commented on SPARK-30242:
---

Hi, [~jetolgon]. Thank you for suggestion. For the new feature, you need to set 
the next version of master branch. As of today, it's 3.0.0 .

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30242:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30242:
--
Component/s: (was: Spark Core)
 SQL

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996044#comment-16996044
 ] 

Dongjoon Hyun commented on SPARK-30249:
---

I believe it's prevented because ORC format doesn't support that.

When you use those column in Parquet file, does Parquet table work incorrectly?

I didn't test it, but It might be a valid format in Parquet file format.

> Invalid Column Names in parquet tables should not be allowed
> 
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.
> Also this induces inconsistency between column names for Parquet and ORC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30250) SparkQL div is undocumented

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30250.
---
Resolution: Duplicate

Thank you for filing a JIRA, [~michaelchirico]. 

SPARK-16323 added at 3.0.0.

- 
[https://github.com/apache/spark/commit/553af22f2c8ecdc039c8d06431564b1432e60d2d]

 

And, it's documented at 3.0.0-preview doc

- https://spark.apache.org/docs/3.0.0-preview/api/sql/#div

 

> SparkQL div is undocumented
> ---
>
> Key: SPARK-30250
> URL: https://issues.apache.org/jira/browse/SPARK-30250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Michael Chirico
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-15407
> Mentions the div operator in SparkQL.
> However, it's undocumented in the SQL API docs:
> https://spark.apache.org/docs/latest/api/sql/index.html
> It's documented in the HiveQL docs:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30250) SparkQL div is undocumented

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30250:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> SparkQL div is undocumented
> ---
>
> Key: SPARK-30250
> URL: https://issues.apache.org/jira/browse/SPARK-30250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Michael Chirico
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-15407
> Mentions the div operator in SparkQL.
> However, it's undocumented in the SQL API docs:
> https://spark.apache.org/docs/latest/api/sql/index.html
> It's documented in the HiveQL docs:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28264) Revisiting Python / pandas UDF

2019-12-13 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28264:

Priority: Blocker  (was: Critical)

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30251) faster way to read csv.gz?

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30251.
-

> faster way to read csv.gz?
> --
>
> Key: SPARK-30251
> URL: https://issues.apache.org/jira/browse/SPARK-30251
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: t oo
>Priority: Major
>
> some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
> uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
> which is 2.5gb uncompressed), now when i tell my boss that famous big data 
> tool spark takes 16hrs to convert the 1gb compressed into parquet then there 
> is look of shock. this is batch data we receive daily (80gb compressed, 2tb 
> uncompressed every day spread across ~300 files).
> i know gz is not splittable so it ends up loaded on single worker. but we 
> dont have space/patience to do a pre-conversion to bz2 or uncompressed. can 
> spark have a better codec? i saw posts mentioning even python is faster than 
> spark
>  
> [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]
> [https://github.com/nielsbasjes/splittablegzip]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30251) faster way to read csv.gz?

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30251.
---
Resolution: Invalid

Hi, [~toopt4] . Sorry, but Jira is not for Q&A. You had better send an email to 
dev.

> faster way to read csv.gz?
> --
>
> Key: SPARK-30251
> URL: https://issues.apache.org/jira/browse/SPARK-30251
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: t oo
>Priority: Major
>
> some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
> uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
> which is 2.5gb uncompressed), now when i tell my boss that famous big data 
> tool spark takes 16hrs to convert the 1gb compressed into parquet then there 
> is look of shock. this is batch data we receive daily (80gb compressed, 2tb 
> uncompressed every day spread across ~300 files).
> i know gz is not splittable so it ends up loaded on single worker. but we 
> dont have space/patience to do a pre-conversion to bz2 or uncompressed. can 
> spark have a better codec? i saw posts mentioning even python is faster than 
> spark
>  
> [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]
> [https://github.com/nielsbasjes/splittablegzip]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-13 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30143.
-
Fix Version/s: 3.0.0
   Resolution: Done

Resolved as part of [https://github.com/apache/spark/pull/26771]

> StreamingQuery.stop() should not block indefinitely
> ---
>
> Key: SPARK-30143
> URL: https://issues.apache.org/jira/browse/SPARK-30143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> The stop() method on a Streaming Query awaits the termination of the stream 
> execution thread. However, the stream execution thread may block forever 
> depending on the streaming source implementation (like in Kafka, which runs 
> UninterruptibleThreads).
> This causes control flow applications to hang indefinitely as well. We'd like 
> to introduce a timeout to stop the execution thread, so that the control flow 
> thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-13 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-30143:
---

Assignee: Burak Yavuz

> StreamingQuery.stop() should not block indefinitely
> ---
>
> Key: SPARK-30143
> URL: https://issues.apache.org/jira/browse/SPARK-30143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> The stop() method on a Streaming Query awaits the termination of the stream 
> execution thread. However, the stream execution thread may block forever 
> depending on the streaming source implementation (like in Kafka, which runs 
> UninterruptibleThreads).
> This causes control flow applications to hang indefinitely as well. We'd like 
> to introduce a timeout to stop the execution thread, so that the control flow 
> thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30167) Log4j configuration for REPL can't override the root logger properly.

2019-12-13 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30167.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26798
[https://github.com/apache/spark/pull/26798]

> Log4j configuration for REPL can't override the root logger properly.
> -
>
> Key: SPARK-30167
> URL: https://issues.apache.org/jira/browse/SPARK-30167
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-11929 enabled REPL's log4j configuration to override root logger but 
> SPARK-26753 seems to have broken the feature.
> You can see one example when you modifies the default log4j configuration 
> like as follows.
> {code:java}
> # Change the log level for rootCategory to DEBUG
> log4j.rootCategory=DEBUG, console
> ...
> # The log level for repl.Main remains WARN
> log4j.logger.org.apache.spark.repl.Main=WARN{code}
> If you launch REPL with the configuration, INFO level logs appear even though 
> the log level for REPL is WARN.
> {code:java}
> ・・・
> 19/12/08 23:31:38 INFO Utils: Successfully started service 'sparkDriver' on 
> port 33083.
> 19/12/08 23:31:38 INFO SparkEnv: Registering MapOutputTracker
> 19/12/08 23:31:38 INFO SparkEnv: Registering BlockManagerMaster
> 19/12/08 23:31:38 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 19/12/08 23:31:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 19/12/08 23:31:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> ・・・{code}
>  
> Before SPARK-26753 was applied, those INFO level logs are not shown with the 
> same log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands

2019-12-13 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-30077.

Resolution: Invalid

> create TEMPORARY VIEW USING should look up catalog/table like v2 commands
> -
>
> Key: SPARK-30077
> URL: https://issues.apache.org/jira/browse/SPARK-30077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> create TEMPORARY VIEW USING should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands

2019-12-13 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995945#comment-16995945
 ] 

Pablo Langa Blanco commented on SPARK-30077:


[~huaxingao] Can we close this ticket? Reading the comments I undestand we are 
not going to do this change. Thanks

> create TEMPORARY VIEW USING should look up catalog/table like v2 commands
> -
>
> Key: SPARK-30077
> URL: https://issues.apache.org/jira/browse/SPARK-30077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> create TEMPORARY VIEW USING should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29563) CREATE TABLE LIKE should look up catalog/table like v2 commands

2019-12-13 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995942#comment-16995942
 ] 

Pablo Langa Blanco commented on SPARK-29563:


[~dkbiswal] Are you still working on this? If not, I can continue. Thanks

> CREATE TABLE LIKE should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-29563
> URL: https://issues.apache.org/jira/browse/SPARK-29563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Shelby Vanhooser (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995896#comment-16995896
 ] 

Shelby Vanhooser commented on SPARK-30257:
--

All tests passing! 

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Critical
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Shelby Vanhooser (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shelby Vanhooser updated SPARK-30257:
-
Labels: PySpark feature  (was: )

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Critical
>  Labels: PySpark, feature
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30258) Eliminate warnings of deprecated Spark APIs in tests

2019-12-13 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30258:
---
Summary: Eliminate warnings of deprecated Spark APIs in tests  (was: 
Eliminate warnings of depracted Spark APIs in tests)

> Eliminate warnings of deprecated Spark APIs in tests
> 
>
> Key: SPARK-30258
> URL: https://issues.apache.org/jira/browse/SPARK-30258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Suppress deprecation warnings in tests that check deprecated Spark APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30258) Eliminate warnings of depracted Spark APIs in tests

2019-12-13 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30258:
--

 Summary: Eliminate warnings of depracted Spark APIs in tests
 Key: SPARK-30258
 URL: https://issues.apache.org/jira/browse/SPARK-30258
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Suppress deprecation warnings in tests that check deprecated Spark APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-13 Thread Samuel Shepard (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995889#comment-16995889
 ] 

Samuel Shepard commented on SPARK-6235:
---

One use case could be fetching large results to the driver when computing PCA 
on large square matrices (e.g., distance matrices, similar to Classical MDS). 
This is very helpful in bioinformatics. Sorry if this already fixed past 
2.4.0...

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29449) Add tooltip to Spark WebUI

2019-12-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29449.
--
Fix Version/s: 3.0.0
   Resolution: Done

> Add tooltip to Spark WebUI
> --
>
> Key: SPARK-29449
> URL: https://issues.apache.org/jira/browse/SPARK-29449
> Project: Spark
>  Issue Type: Umbrella
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> The initial effort was made in 
> https://issues.apache.org/jira/browse/SPARK-2384. This umbrella Jira is to 
> track the progress of adding tooltip to all the WebUI for better usability.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29455) Improve tooltip information for Stages Tab

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29455.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26859
[https://github.com/apache/spark/pull/26859]

> Improve tooltip information for Stages Tab
> --
>
> Key: SPARK-29455
> URL: https://issues.apache.org/jira/browse/SPARK-29455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Sharanabasappa G Keriwaddi
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29455) Improve tooltip information for Stages Tab

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29455:
-

Assignee: Sharanabasappa G Keriwaddi

> Improve tooltip information for Stages Tab
> --
>
> Key: SPARK-29455
> URL: https://issues.apache.org/jira/browse/SPARK-29455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Sharanabasappa G Keriwaddi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30216) Use python3 in Docker release image

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30216.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26848
[https://github.com/apache/spark/pull/26848]

> Use python3 in Docker release image
> ---
>
> Key: SPARK-30216
> URL: https://issues.apache.org/jira/browse/SPARK-30216
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30230) Like ESCAPE syntax can not use '_' and '%'

2019-12-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995838#comment-16995838
 ] 

Dongjoon Hyun commented on SPARK-30230:
---

The commit is reverted via 
https://github.com/apache/spark/commit/4da9780bc0a12672b45ffdcc28e594593bc68350

> Like ESCAPE syntax can not use '_' and '%'
> --
>
> Key: SPARK-30230
> URL: https://issues.apache.org/jira/browse/SPARK-30230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Major
>
> '%' and '_' is the reserve char in `Like` expression. We can not use them as 
> escape char.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30230) Like ESCAPE syntax can not use '_' and '%'

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-30230:
---
  Assignee: (was: ulysses you)

> Like ESCAPE syntax can not use '_' and '%'
> --
>
> Key: SPARK-30230
> URL: https://issues.apache.org/jira/browse/SPARK-30230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Major
>
> '%' and '_' is the reserve char in `Like` expression. We can not use them as 
> escape char.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Shelby Vanhooser (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shelby Vanhooser updated SPARK-30257:
-
Component/s: (was: Input/Output)
 PySpark

> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Critical
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Shelby Vanhooser (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shelby Vanhooser updated SPARK-30257:
-
Description: 
The PySpark mapping from simpleString to Spark SQL types are too manual right 
now; instead, pyspark.sql.types should expose a method that maps the 
simpleString representation of these types to the underlying Spark SQL ones

 

Tracked here : [https://github.com/apache/spark/pull/26884]

  was:The PySpark mapping from simpleString to Spark SQL types are too manual 
right now; instead, pyspark.sql.types should expose a method that maps the 
simpleString representation of these types to the underlying Spark SQL ones


> Mapping simpleString to Spark SQL types
> ---
>
> Key: SPARK-30257
> URL: https://issues.apache.org/jira/browse/SPARK-30257
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Critical
>
> The PySpark mapping from simpleString to Spark SQL types are too manual right 
> now; instead, pyspark.sql.types should expose a method that maps the 
> simpleString representation of these types to the underlying Spark SQL ones
>  
> Tracked here : [https://github.com/apache/spark/pull/26884]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30257) Mapping simpleString to Spark SQL types

2019-12-13 Thread Shelby Vanhooser (Jira)

Shelby Vanhooser created SPARK-30257:


 Summary: Mapping simpleString to Spark SQL types
 Key: SPARK-30257
 URL: https://issues.apache.org/jira/browse/SPARK-30257
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Shelby Vanhooser


The PySpark mapping from simpleString to Spark SQL types are too manual right 
now; instead, pyspark.sql.types should expose a method that maps the 
simpleString representation of these types to the underlying Spark SQL ones



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-12-13 Thread Nasir Ali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995815#comment-16995815
 ] 

Nasir Ali edited comment on SPARK-28502 at 12/13/19 6:45 PM:
-

{code:java}
import numpy as np
import pandas as pd
import json
from geopy.distance import great_circle
from pyspark.sql.functions import pandas_udf, PandasUDFType
from shapely.geometry.multipoint import MultiPoint
from sklearn.cluster import DBSCAN
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
MapType
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([
   StructField("timestamp", TimestampType()),
   StructField("window", StructType([
   StructField("start", TimestampType()),
   StructField("end", TimestampType())])),
   StructField("some_val", StringType())
   ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_win_col(key, user_data):
all_vals = []
for index, row in user_data.iterrows():
all_vals.append([row["timestamp"],key[2],"tesss"])

return pd.DataFrame(all_vals,columns=['timestamp','window','some_val'])
{code}
I am not even able to manually return window column. It throws error
{code:java}
Traceback (most recent call last):
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in 
returnType
to_arrow_type(self._returnType_placeholder)
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, 
in to_arrow_type
raise TypeError("Nested StructType not supported in conversion to Arrow")
TypeError: Nested StructType not supported in conversion to ArrowDuring 
handling of the above exception, another exception occurred:Traceback (most 
recent call last):
  File "", line 1, in 
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in 
_create_udf
return udf_obj._wrapped()
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in 
_wrapped
wrapper.returnType = self.returnType
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in 
returnType
"%s is not supported" % str(self._returnType_placeholder))
NotImplementedError: Invalid returnType with grouped map Pandas UDFs: 
StructType(List(StructField(timestamp,TimestampType,true),StructField(window,StructType(List(StructField(start,TimestampType,true),StructField(end,TimestampType,true))),true),StructField(some_val,StringType,true)))
 is not supported
{code}
However, if I manually run *to_arrow_schema(schema)*. It works all fine and 
there is no exception. 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L139]
{code:java}
from pyspark.sql.types import to_arrow_schema
to_arrow_schema(schema)
{code}


was (Author: nasirali):
{code:java}
import numpy as np
import pandas as pd
import json
from geopy.distance import great_circle
from pyspark.sql.functions import pandas_udf, PandasUDFType
from shapely.geometry.multipoint import MultiPoint
from sklearn.cluster import DBSCAN
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
MapType
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([
   StructField("timestamp", TimestampType()),
   StructField("window", StructType([
   StructField("start", TimestampType()),
   StructField("end", TimestampType())])),
   StructField("some_val", StringType())
   ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_win_col(key, user_data):
all_vals = []
for index, row in user_data.iterrows():
all_vals.append([row["timestamp"],key[2],"tesss"])

return pd.DataFrame(all_vals,columns=['timestamp','window','some_val'])
{code}
I am not even able to manually return window column. It throws error
{code:java}
Traceback (most recent call last):
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in 
returnType
to_arrow_type(self._returnType_placeholder)
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, 
in to_arrow_type
raise TypeError("Nested StructType not supported in conversion to Arrow")
TypeError: Nested StructType not supported in conversion to ArrowDuring 
handling of the above exception, another exception occurred:Traceback (most 
recent call last):
  File "", line 1, in 
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in 
_create_udf
return udf_obj._wrapped()
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in 
_wrapped
wrapper.returnType = self.returnType
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in 
returnType
"%s is not supported" % str(self._returnType_placeholder))
NotImplementedError: Invalid returnType wi

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-12-13 Thread Nasir Ali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995815#comment-16995815
 ] 

Nasir Ali commented on SPARK-28502:
---

{code:java}
import numpy as np
import pandas as pd
import json
from geopy.distance import great_circle
from pyspark.sql.functions import pandas_udf, PandasUDFType
from shapely.geometry.multipoint import MultiPoint
from sklearn.cluster import DBSCAN
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
MapType
from pyspark.sql.types import StructField, StructType, StringType, FloatType, 
TimestampType, IntegerType,DateType,TimestampTypeschema = StructType([
   StructField("timestamp", TimestampType()),
   StructField("window", StructType([
   StructField("start", TimestampType()),
   StructField("end", TimestampType())])),
   StructField("some_val", StringType())
   ])@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_win_col(key, user_data):
all_vals = []
for index, row in user_data.iterrows():
all_vals.append([row["timestamp"],key[2],"tesss"])

return pd.DataFrame(all_vals,columns=['timestamp','window','some_val'])
{code}
I am not even able to manually return window column. It throws error
{code:java}
Traceback (most recent call last):
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 139, in 
returnType
to_arrow_type(self._returnType_placeholder)
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/types.py", line 1641, 
in to_arrow_type
raise TypeError("Nested StructType not supported in conversion to Arrow")
TypeError: Nested StructType not supported in conversion to ArrowDuring 
handling of the above exception, another exception occurred:Traceback (most 
recent call last):
  File "", line 1, in 
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 79, in 
_create_udf
return udf_obj._wrapped()
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 234, in 
_wrapped
wrapper.returnType = self.returnType
  File "/usr/local/spark-3.0.0-preview/python/pyspark/sql/udf.py", line 143, in 
returnType
"%s is not supported" % str(self._returnType_placeholder))
NotImplementedError: Invalid returnType with grouped map Pandas UDFs: 
StructType(List(StructField(timestamp,TimestampType,true),StructField(window,StructType(List(StructField(start,TimestampType,true),StructField(end,TimestampType,true))),true),StructField(some_val,StringType,true)))
 is not supported
{code}
However, if I manually run *to_arrow_schema(schema)*. It works all fine and 
there is no exception. 
{code:java}
from pyspark.sql.types import to_arrow_schema
to_arrow_schema(schema)
{code}

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days"

[jira] [Updated] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit

2019-12-13 Thread Jeff Evans (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-30256:
---
Description: It would be useful if 
{{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as user X" 
option.  This way, multi-tenant applications that run Spark jobs could give end 
users greater security, by ensuring that the files (including, importantly, 
keytabs) can remain readable only by the end users instead of the UID that runs 
this multi-tenant application itself.  I believe that {{sudo -u  
spark-submit }} should work.  The builder maintained by 
{{SparkLauncher}} could simply have a {{setSudoUser}} method.  (was: It would 
be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as 
user X" option.  This way, multi-tenant applications that run Spark jobs could 
give end users greater security, by ensuring that the files (including, 
importantly, keytabs) can remain readable only by the end users instead of the 
UID that runs this multi-tenant application itself.  I believe that {{sudo -u 
 spark-submit  Allow SparkLauncher to sudo before executing spark-submit
> -
>
> Key: SPARK-30256
> URL: https://issues.apache.org/jira/browse/SPARK-30256
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Jeff Evans
>Priority: Minor
>
> It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for 
> a "sudo as user X" option.  This way, multi-tenant applications that run 
> Spark jobs could give end users greater security, by ensuring that the files 
> (including, importantly, keytabs) can remain readable only by the end users 
> instead of the UID that runs this multi-tenant application itself.  I believe 
> that {{sudo -u  spark-submit }} should work.  The 
> builder maintained by {{SparkLauncher}} could simply have a {{setSudoUser}} 
> method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit

2019-12-13 Thread Jeff Evans (Jira)

Jeff Evans created SPARK-30256:
--

 Summary: Allow SparkLauncher to sudo before executing spark-submit
 Key: SPARK-30256
 URL: https://issues.apache.org/jira/browse/SPARK-30256
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Jeff Evans


It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a 
"sudo as user X" option.  This way, multi-tenant applications that run Spark 
jobs could give end users greater security, by ensuring that the files 
(including, importantly, keytabs) can remain readable only by the end users 
instead of the UID that runs this multi-tenant application itself.  I believe 
that {{sudo -u  spark-submit

[jira] [Commented] (SPARK-30168) Eliminate warnings in Parquet datasource

2019-12-13 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995754#comment-16995754
 ] 

Maxim Gekk commented on SPARK-30168:


[~Ankitraj] Go ahead.

> Eliminate warnings in Parquet datasource
> 
>
> Key: SPARK-30168
> URL: https://issues.apache.org/jira/browse/SPARK-30168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> # 
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
> {code}
> Warning:Warning:line (120)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
>   Option[TimeZone]) => RecordReader[Void, T]): RecordReader[Void, T] 
> = {
> Warning:Warning:line (125)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
>   new org.apache.parquet.hadoop.ParquetInputSplit(
> Warning:Warning:line (134)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
>   ParquetFileReader.readFooter(conf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> Warning:Warning:line (183)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
>   split: ParquetInputSplit,
> Warning:Warning:line (212)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
>   split: ParquetInputSplit,
> {code}
> # 
> sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
> {code}
> Warning:Warning:line (55)java: org.apache.parquet.hadoop.ParquetInputSplit in 
> org.apache.parquet.hadoop has been deprecated
> Warning:Warning:line (95)java: 
> org.apache.parquet.hadoop.ParquetInputSplit in org.apache.parquet.hadoop has 
> been deprecated
> Warning:Warning:line (95)java: 
> org.apache.parquet.hadoop.ParquetInputSplit in org.apache.parquet.hadoop has 
> been deprecated
> Warning:Warning:line (97)java: getRowGroupOffsets() in 
> org.apache.parquet.hadoop.ParquetInputSplit has been deprecated
> Warning:Warning:line (105)java: 
> readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:Warning:line (108)java: 
> filterRowGroups(org.apache.parquet.filter2.compat.FilterCompat.Filter,java.util.List,org.apache.parquet.schema.MessageType)
>  in org.apache.parquet.filter2.compat.RowGroupFilter has been deprecated
> Warning:Warning:line (111)java: 
> readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:Warning:line (147)java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.parquet.hadoop.metadata.FileMetaData,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:Warning:line (203)java: 
> readFooter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,org.apache.parquet.format.converter.ParquetMetadataConverter.MetadataFilter)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:Warning:line (226)java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.parquet.hadoop.metadata.FileMetaData,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> {code}
> # 
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompatibilityTest.scala
> # 
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
> # 
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTest.scala
> # 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30243) Upgrade K8s client dependency to 4.6.4

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30243:
-

Assignee: Dongjoon Hyun

> Upgrade K8s client dependency to 4.6.4
> --
>
> Key: SPARK-30243
> URL: https://issues.apache.org/jira/browse/SPARK-30243
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30243) Upgrade K8s client dependency to 4.6.4

2019-12-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30243.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26874
[https://github.com/apache/spark/pull/26874]

> Upgrade K8s client dependency to 4.6.4
> --
>
> Key: SPARK-30243
> URL: https://issues.apache.org/jira/browse/SPARK-30243
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30248) DROP TABLE doesn't work if session catalog name is provided

2019-12-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30248.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> DROP TABLE doesn't work if session catalog name is provided
> ---
>
> Key: SPARK-30248
> URL: https://issues.apache.org/jira/browse/SPARK-30248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> If a table name is qualified with session catalog name ("spark_catalog"), the 
> DROP TABLE command fails.
> For example, the following
> {code:java}
> sql("CREATE TABLE tbl USING json AS SELECT 1 AS i")
> sql("DROP TABLE spark_catalog.tbl")
> {code}
> fails with:
> {code:java}
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'spark_catalog' not found;
>at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
>at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
>at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:45)
>at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:336)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries

2019-12-13 Thread Xiaoju Wu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995627#comment-16995627
 ] 

Xiaoju Wu commented on SPARK-30072:
---

[~cloud_fan] If the sql looks like:

SELECT * FROM df2 WHERE df2.k = (SELECT max(df2.k) FROM df1 JOIN df2 ON df1.k = 
df2.k AND df2.id < 2)

The nested subquery "SELECT max(df2.k) FROM df1 JOIN df2 ON df1.k = df2.k AND 
df2.id < 2" will be run in another QueryExecution,  there's no way to pass 
"isSubquery" information to InsertAdaptiveSparkPlan in the nested 
QueryExecution.

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30255) Support explain mode in SparkR df.explain

2019-12-13 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-30255:


 Summary: Support explain mode in SparkR df.explain
 Key: SPARK-30255
 URL: https://issues.apache.org/jira/browse/SPARK-30255
 Project: Spark
  Issue Type: Improvement
  Components: R, SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


This pr intends to support explain modes implemented in SPARK-30200(#26829) for 
SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30227) Add close() on DataWriter interface

2019-12-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30227.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26855
[https://github.com/apache/spark/pull/26855]

> Add close() on DataWriter interface
> ---
>
> Key: SPARK-30227
> URL: https://issues.apache.org/jira/browse/SPARK-30227
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter 
> instance ends at either commit() or abort(). That makes datasource 
> implementors to feel they can place resource cleanup in both sides, but 
> abort() can be called when commit() fails; so they have to ensure they don't 
> do double-cleanup if cleanup is not idempotent.
> So I'm proposing to add close() on DataWriter explicitly, which is "the 
> place" for resource cleanup. The lifecycle of DataWriter instance will (and 
> should) end at close().
> I've checked some callers to see whether they can apply "try-catch-finally" 
> to ensure close() is called at the end of lifecycle for DataWriter, and they 
> look like so.
> The change would bring backward incompatible change, but given the interface 
> is marked as Evolving and we're making backward incompatible changes in Spark 
> 3.0, so I feel it may not matter.
> I've raised the discussion around this issue and the feedbacks are positive: 
> https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30227) Add close() on DataWriter interface

2019-12-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30227:
---

Assignee: Jungtaek Lim

> Add close() on DataWriter interface
> ---
>
> Key: SPARK-30227
> URL: https://issues.apache.org/jira/browse/SPARK-30227
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter 
> instance ends at either commit() or abort(). That makes datasource 
> implementors to feel they can place resource cleanup in both sides, but 
> abort() can be called when commit() fails; so they have to ensure they don't 
> do double-cleanup if cleanup is not idempotent.
> So I'm proposing to add close() on DataWriter explicitly, which is "the 
> place" for resource cleanup. The lifecycle of DataWriter instance will (and 
> should) end at close().
> I've checked some callers to see whether they can apply "try-catch-finally" 
> to ensure close() is called at the end of lifecycle for DataWriter, and they 
> look like so.
> The change would bring backward incompatible change, but given the interface 
> is marked as Evolving and we're making backward incompatible changes in Spark 
> 3.0, so I feel it may not matter.
> I've raised the discussion around this issue and the feedbacks are positive: 
> https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29741) Spark Application UI- In Environment tab add "Search" option

2019-12-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29741.
--
Resolution: Won't Fix

> Spark Application UI- In Environment tab add "Search" option
> 
>
> Key: SPARK-29741
> URL: https://issues.apache.org/jira/browse/SPARK-29741
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Spark Application UI- environment tab add "Search" option.
> As there are different sections in Environment tab now for information's and 
> properties like 
>  Runtime Information,Spark Properties,Hadoop Properties, System Properties & 
> Classpath Entries,better to give one *Search* field .So it wil be easy to 
> search any parameter value even though we don't know in which section it will 
> come.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30079) Tests fail in environments with locale different from en_US

2019-12-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30079.
--
Resolution: Not A Problem

> Tests fail in environments with locale different from en_US
> ---
>
> Key: SPARK-30079
> URL: https://issues.apache.org/jira/browse/SPARK-30079
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.0.0
> Environment: any environment, with non-english locale and/or 
> different separators for numbers.
>Reporter: Lukas Menzel
>Priority: Trivial
>
> Tests fail on systems with different locale than en_US.
> Assertions regarding messages of exceptions fail, because they are localized 
> by Java depending on the system environment. (e.g 
> org.apache.spark.deploy.SparkSubmitSuite)
> Other tests fail because of assertions about formatted numbers, which use a 
> different separators (see 
> [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html])
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-12-13 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995538#comment-16995538
 ] 

Attila Zsolt Piros commented on SPARK-27021:



[~roncenzhao]

# yes
# I do not think so. This bug mostly effects the test system as test execution 
is the place where multiple TransportContext, NettyRpcEnv, etc ... are created 
and not closed correctly.

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30208) A race condition when reading from Kafka in PySpark

2019-12-13 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995532#comment-16995532
 ] 

Jungtaek Lim commented on SPARK-30208:
--

I've just tested it simply with additional logging:
{code:java}
from pyspark.sql import SparkSession

spark = 
SparkSession.builder.appName("TaskCompletionListenerTesting").getOrCreate()

df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "topic1") \
  .load()

def f(rows):
  for row in rows:
print(row.key)

df.foreachPartition(f) {code}
and no, KafkaRDD registers earlier than PythonRunner which would mean callback 
from PythonRunner will be called earlier. It sounds natural as KafkaRDD is a 
data source hence should be placed first. (I can't imagine the other case)

So my guess seems wrong; there's another slightly possible case - complete 
callback of PythonRunner doesn't even join the writer thread - but given this 
is about race-condition so I'm not 100% sure.

> A race condition when reading from Kafka in PySpark
> ---
>
> Key: SPARK-30208
> URL: https://issues.apache.org/jira/browse/SPARK-30208
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Jiawen Zhu
>Priority: Major
>
> When using PySpark to read from Kafka, there is a race condition that Spark 
> may use KafkaConsumer in multiple threads at the same time and throw the 
> following error:
> {code}
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:2215)
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2104)
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2059)
> at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.close(KafkaDataConsumer.scala:451)
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$NonCachedKafkaDataConsumer.release(KafkaDataConsumer.scala:508)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.close(KafkaSourceRDD.scala:126)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:131)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:130)
> at 
> org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:162)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:144)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:142)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:142)
> at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:130)
> at org.apache.spark.scheduler.Task.doRunTask(Task.scala:155)
> at org.apache.spark.scheduler.Task.run(Task.scala:112)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> When using PySpark, reading from Kafka is actually happening in a separate 
> writer thread rather that the task thread.  When a task is early terminated 
> (e.g., there is a limit operator), the task thread may stop the KafkaConsumer 
> when the writer thread is using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.

2019-12-13 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995531#comment-16995531
 ] 

pavithra ramachandran commented on SPARK-28825:
---

[~dkbiswal] are you working on this? If not I would like to work on this.

> Document EXPLAIN Statement in SQL Reference.
> 
>
> Key: SPARK-28825
> URL: https://issues.apache.org/jira/browse/SPARK-28825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30254) Fix use custom escape lead to LikeSimplification optimize failed

2019-12-13 Thread ulysses you (Jira)

ulysses you created SPARK-30254:
---

 Summary: Fix use custom escape lead to LikeSimplification optimize 
failed
 Key: SPARK-30254
 URL: https://issues.apache.org/jira/browse/SPARK-30254
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ulysses you


We should also sync the escape used by `LikeSimplification`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30253) Do not add commits when releasing preview version

2019-12-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30253:

Attachment: 3.0.0-preview.png

> Do not add commits when releasing preview version
> -
>
> Key: SPARK-30253
> URL: https://issues.apache.org/jira/browse/SPARK-30253
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: 3.0.0-preview.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30253) Do not add commits when releasing preview version

2019-12-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30253:

Summary: Do not add commits when releasing preview version  (was: Preview 
release does not add version change commits to master branch)

> Do not add commits when releasing preview version
> -
>
> Key: SPARK-30253
> URL: https://issues.apache.org/jira/browse/SPARK-30253
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30253) Preview release does not add version change commits to master branch

2019-12-13 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-30253:
---

 Summary: Preview release does not add version change commits to 
master branch
 Key: SPARK-30253
 URL: https://issues.apache.org/jira/browse/SPARK-30253
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30252) Disallow negative scale of Decimal under ansi mode

2019-12-13 Thread wuyi (Jira)

wuyi created SPARK-30252:


 Summary: Disallow negative scale of Decimal under ansi mode
 Key: SPARK-30252
 URL: https://issues.apache.org/jira/browse/SPARK-30252
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


According to SQL standard,
{quote}4.4.2 Characteristics of numbers
An exact numeric type has a precision P and a scale S. P is a positive integer 
that determines the number of significant digits in a particular radix R, where 
R is either 2 or 10. S is a non-negative integer.
{quote}
scale of Decimal should always be non-negative. And other mainstream databases, 
like Presto, PostgreSQL, also don't allow negative scale.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30231) Support explain mode in PySpark df.explain

2019-12-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30231.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26861
[https://github.com/apache/spark/pull/26861]

> Support explain mode in PySpark df.explain
> --
>
> Key: SPARK-30231
> URL: https://issues.apache.org/jira/browse/SPARK-30231
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> This pr intends to support explain modes implemented in SPARK-30200(#26829) 
> for PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30231) Support explain mode in PySpark df.explain

2019-12-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30231:


Assignee: Takeshi Yamamuro

> Support explain mode in PySpark df.explain
> --
>
> Key: SPARK-30231
> URL: https://issues.apache.org/jira/browse/SPARK-30231
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This pr intends to support explain modes implemented in SPARK-30200(#26829) 
> for PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30251) faster way to read csv.gz?

2019-12-13 Thread t oo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-30251:
-
Description: 
some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
which is 2.5gb uncompressed), now when i tell my boss that famous big data tool 
spark takes 16hrs to convert the 1gb compressed into parquet then there is look 
of shock. this is batch data we receive daily (80gb compressed, 2tb 
uncompressed every day spread across ~300 files).

i know gz is not splittable so it ends up loaded on single worker. but we dont 
have space/patience to do a pre-conversion to bz2 or uncompressed. can spark 
have a better codec? i saw posts mentioning even python is faster than spark

 

[https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]

[https://github.com/nielsbasjes/splittablegzip]

 

 

  was:
some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
which is 2.5gb uncompressed), now when i tell my boss that famous big data tool 
spark takes 16hrs to convert the 1gb compressed into parquet then there is look 
of shock. this is batch data we receive daily (80gb compressed, 2tb 
uncompressed every day spread across ~300 files).

i know gz is not splittable so currently loaded on single worker. but we dont 
have space/patience to do a pre-conversion to bz2 or uncompressed. can spark 
have a better codec? i saw posts mentioning even python is faster than spark

 

[https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]

[https://github.com/nielsbasjes/splittablegzip]

 

 


> faster way to read csv.gz?
> --
>
> Key: SPARK-30251
> URL: https://issues.apache.org/jira/browse/SPARK-30251
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: t oo
>Priority: Major
>
> some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
> uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
> which is 2.5gb uncompressed), now when i tell my boss that famous big data 
> tool spark takes 16hrs to convert the 1gb compressed into parquet then there 
> is look of shock. this is batch data we receive daily (80gb compressed, 2tb 
> uncompressed every day spread across ~300 files).
> i know gz is not splittable so it ends up loaded on single worker. but we 
> dont have space/patience to do a pre-conversion to bz2 or uncompressed. can 
> spark have a better codec? i saw posts mentioning even python is faster than 
> spark
>  
> [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]
> [https://github.com/nielsbasjes/splittablegzip]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 101 matches

Mail list logo