[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263152#comment-17263152
 ] 

Apache Spark commented on SPARK-32380:
--

User 'yangBottle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31147

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.It

[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263153#comment-17263153
 ] 

Apache Spark commented on SPARK-32380:
--

User 'yangBottle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31147

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.It

[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Cristi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi updated SPARK-33867:
---
Description: 
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.

To reproduce you can write a simple filter like: 

dataset.filter(current_timestamp().gt(col(VALID_FROM)))

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:834)

  was:
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "validity_end" > 2020-12-21T11:40:24.413681Z

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$

[jira] [Commented] (SPARK-34079) Improvement CTE table scan

2021-01-12 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263168#comment-17263168
 ] 

Peter Toth commented on SPARK-34079:


Thanks for pinging me [~yumwang]. I'm happy to work on this if you haven't 
started it yet.

> Improvement CTE table scan
> --
>
> Key: SPARK-34079
> URL: https://issues.apache.org/jira/browse/SPARK-34079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Prepare table:
> {code:sql}
> CREATE TABLE store_sales (  ss_sold_date_sk INT,  ss_sold_time_sk INT,  
> ss_item_sk INT,  ss_customer_sk INT,  ss_cdemo_sk INT,  ss_hdemo_sk INT,  
> ss_addr_sk INT,  ss_store_sk INT,  ss_promo_sk INT,  ss_ticket_number INT,  
> ss_quantity INT,  ss_wholesale_cost DECIMAL(7,2),  ss_list_price 
> DECIMAL(7,2),  ss_sales_price DECIMAL(7,2),  ss_ext_discount_amt 
> DECIMAL(7,2),  ss_ext_sales_price DECIMAL(7,2),  ss_ext_wholesale_cost 
> DECIMAL(7,2),  ss_ext_list_price DECIMAL(7,2),  ss_ext_tax DECIMAL(7,2),  
> ss_coupon_amt DECIMAL(7,2),  ss_net_paid DECIMAL(7,2),  ss_net_paid_inc_tax 
> DECIMAL(7,2),ss_net_profit DECIMAL(7,2));
> CREATE TABLE reason (  r_reason_sk INT,  r_reason_id varchar(255),  
> r_reason_desc varchar(255));
> {code}
> SQL:
> {code:sql}
> WITH bucket_result AS (
> SELECT
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity 
> END)) > 62316685
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) 
> END bucket1,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) 
> END bucket2,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) 
> END bucket3,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) 
> END bucket4,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) 
> END bucket5
>   FROM store_sales
> )
> SELECT
>   (SELECT bucket1 FROM bucket_result) as bucket1,
>   (SELECT bucket2 FROM bucket_result) as bucket2,
>   (SELECT bucket3 FROM bucket_result) as bucket3,
>   (SELECT bucket4 FROM bucket_result) as bucket4,
>   (SELECT bucket5 FROM bucket_result) as bucket5
> FROM reason
> WHERE r_reason_sk = 1;
> {code}
> Plan of Spark SQL:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, 
> [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery 
> subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9]
>:  :- Subquery subquery#0, [id=#23]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 
> >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), 
> avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) 
> ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= 
> 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))])
>:  :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21]
>:  :   +- HashAggregate(keys=[], functions=[partial_count(if 
> (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else 
> null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND 
> (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), 
> partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 
> 20))) ss_net_paid#38 else null))])
>:  :  +- FileScan parquet 
> default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>:  :- Subquery subquery#2, [id=#34]
>:  :  +- AdaptiveS

[jira] [Created] (SPARK-34083) Using TPCDS original definitions for char/varchar columns

2021-01-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-34083:


 Summary: Using TPCDS original definitions for char/varchar columns
 Key: SPARK-34083
 URL: https://issues.apache.org/jira/browse/SPARK-34083
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0, 3.2.0
Reporter: Kent Yao


Using TPCDS original definitions for char/varchar columns instead of the 
modified string




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Cristi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi updated SPARK-33867:
---
Description: 
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.

To reproduce you can write a simple filter like where dataset is backed by a DB 
table (in b=my case PostgreSQL): 

dataset.filter(current_timestamp().gt(col(VALID_FROM)))

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:834)

  was:
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.

To reproduce you can write a simple filter like: 

dataset.filter(current_timestamp().gt(col(VALID_FROM)))

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.e

[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Cristi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi updated SPARK-33867:
---
Priority: Major  (was: Minor)

> java.time.Instant and java.time.LocalDate not handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue
> ---
>
> Key: SPARK-33867
> URL: https://issues.apache.org/jira/browse/SPARK-33867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Cristi
>Priority: Major
>
> When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
> LocalDate and Instant aren't handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
> when they are used in filters since a filter condition would be translated to 
> something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.
> To reproduce you can write a simple filter like where dataset is backed by a 
> DB table (in b=my case PostgreSQL): 
> dataset.filter(current_timestamp().gt(col(VALID_FROM)))
> The error and stacktrace:
> Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
> "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
> near "T11"  Position: 285 at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) 
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions

2021-01-12 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-34082.

Resolution: Invalid

Close it due to {{cannot resolve 'b' given input columns}} seems a correct 
error message. Filter should be resolved before Projection. I was confused with 
QUALIFY syntax in our internal Spark version.

> Window expression with alias inside WHERE and HAVING clauses fail with 
> non-descriptive exceptions
> -
>
> Key: SPARK-34082
> URL: https://issues.apache.org/jira/browse/SPARK-34082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But 
> if the window expression with alias inside WHERE and HAVING clauses, Spark 
> does not handle this explicitly and will fail with non-descriptive exceptions.
> {code}
> SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1
> {code}
> {code}
> cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b]
> {code}
> {code}
> SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s
> FROM testData2
> GROUP BY a
> HAVING SUM(b) = 5 AND s = 1
> {code}
> {code}
> cannot resolve '`b`' given input columns: [testdata2.a, max(b)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions

2021-01-12 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin closed SPARK-34082.
--

> Window expression with alias inside WHERE and HAVING clauses fail with 
> non-descriptive exceptions
> -
>
> Key: SPARK-34082
> URL: https://issues.apache.org/jira/browse/SPARK-34082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But 
> if the window expression with alias inside WHERE and HAVING clauses, Spark 
> does not handle this explicitly and will fail with non-descriptive exceptions.
> {code}
> SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1
> {code}
> {code}
> cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b]
> {code}
> {code}
> SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s
> FROM testData2
> GROUP BY a
> HAVING SUM(b) = 5 AND s = 1
> {code}
> {code}
> cannot resolve '`b`' given input columns: [testdata2.a, max(b)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34067) PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time

2021-01-12 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-34067:
---
Affects Version/s: 3.1.1
   3.2.0
   3.1.0
   3.0.2

> PartitionPruning push down pruningHasBenefit function into insertPredicate 
> function to decrease calculate time
> --
>
> Key: SPARK-34067
> URL: https://issues.apache.org/jira/browse/SPARK-34067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1
>Reporter: jiahong.li
>Priority: Minor
>
> in the class PartitionPruning, function prune, push pruningHasBenefit 
> function into insertPredicate function, as `SQLConf.get.exchangeReuseEnabled` 
> is always  true default and 
> `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly `  is always true 
> default.
> by set hasBenefit to lazy ,we can do not need invoke hasBenefit ,so we can 
> save time.
> solved by #31122



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-34084:
--

 Summary: ALTER TABLE .. ADD PARTITION does not update table stats
 Key: SPARK-34084
 URL: https://issues.apache.org/jira/browse/SPARK-34084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.2.0, 3.1.1
 Environment: strong text
Reporter: Maxim Gekk


The example below portraits the issue:
{code:sql}
spark-sql> create table tbl (col0 int, part int) partitioned by (part);
spark-sql> insert into tbl partition (part = 0) select 0;
spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
spark-sql> alter table tbl add partition (part = 1);
{code}
There are no stats:
{code:sql}
spark-sql> describe table extended tbl;
col0int NULL
partint NULL
# Partition Information
# col_name  data_type   comment
partint NULL

# Detailed Table Information
Databasedefault
Table   tbl
Owner   maximgekk
Created TimeTue Jan 12 12:00:03 MSK 2021
Last Access UNKNOWN
Created By  Spark 3.2.0-SNAPSHOT
TypeMANAGED
Providerhive
Table Properties[transient_lastDdlTime=1610442003]
Location
file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties  [serialization.format=1]
Partition Provider  Catalog
{code}

*As we can see there is no stats.* For instance, ALTER TABLE .. DROP PARTITION 
updates stats:
{code:sql}
spark-sql> alter table tbl drop partition (part = 1);
spark-sql> describe table extended tbl;
col0int NULL
partint NULL
# Partition Information
# col_name  data_type   comment
partint NULL

# Detailed Table Information
...
Statistics  2 bytes
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28065) ntile only accepting positive (>0) values

2021-01-12 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263187#comment-17263187
 ] 

jiaan.geng commented on SPARK-28065:


After investigation, it is found that most databases require it to be a 
positive integer greater than 0.

> ntile only accepting positive (>0) values
> -
>
> Key: SPARK-28065
> URL: https://issues.apache.org/jira/browse/SPARK-28065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark does not accept null as an input for `ntile`, or zero, 
> however Postgres supports it.
>  Example:
> {code:sql}
> SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Cristi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi updated SPARK-33867:
---
Description: 
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.

To reproduce you can write a simple filter like where dataset is backed by a DB 
table (in my case PostgreSQL): 

dataset.filter(current_timestamp().gt(col(VALID_FROM)))

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:834)

  was:
When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
LocalDate and Instant aren't handled in 
org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
when they are used in filters since a filter condition would be translated to 
something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.

To reproduce you can write a simple filter like where dataset is backed by a DB 
table (in b=my case PostgreSQL): 

dataset.filter(current_timestamp().gt(col(VALID_FROM)))

The error and stacktrace:

Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
"T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "T11"  Position: 285 at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
 at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.

[jira] [Commented] (SPARK-34079) Improvement CTE table scan

2021-01-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263193#comment-17263193
 ] 

Yuming Wang commented on SPARK-34079:
-

Thank you. Go ahead, please.

> Improvement CTE table scan
> --
>
> Key: SPARK-34079
> URL: https://issues.apache.org/jira/browse/SPARK-34079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Prepare table:
> {code:sql}
> CREATE TABLE store_sales (  ss_sold_date_sk INT,  ss_sold_time_sk INT,  
> ss_item_sk INT,  ss_customer_sk INT,  ss_cdemo_sk INT,  ss_hdemo_sk INT,  
> ss_addr_sk INT,  ss_store_sk INT,  ss_promo_sk INT,  ss_ticket_number INT,  
> ss_quantity INT,  ss_wholesale_cost DECIMAL(7,2),  ss_list_price 
> DECIMAL(7,2),  ss_sales_price DECIMAL(7,2),  ss_ext_discount_amt 
> DECIMAL(7,2),  ss_ext_sales_price DECIMAL(7,2),  ss_ext_wholesale_cost 
> DECIMAL(7,2),  ss_ext_list_price DECIMAL(7,2),  ss_ext_tax DECIMAL(7,2),  
> ss_coupon_amt DECIMAL(7,2),  ss_net_paid DECIMAL(7,2),  ss_net_paid_inc_tax 
> DECIMAL(7,2),ss_net_profit DECIMAL(7,2));
> CREATE TABLE reason (  r_reason_sk INT,  r_reason_id varchar(255),  
> r_reason_desc varchar(255));
> {code}
> SQL:
> {code:sql}
> WITH bucket_result AS (
> SELECT
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity 
> END)) > 62316685
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) 
> END bucket1,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) 
> END bucket2,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) 
> END bucket3,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) 
> END bucket4,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) 
> END bucket5
>   FROM store_sales
> )
> SELECT
>   (SELECT bucket1 FROM bucket_result) as bucket1,
>   (SELECT bucket2 FROM bucket_result) as bucket2,
>   (SELECT bucket3 FROM bucket_result) as bucket3,
>   (SELECT bucket4 FROM bucket_result) as bucket4,
>   (SELECT bucket5 FROM bucket_result) as bucket5
> FROM reason
> WHERE r_reason_sk = 1;
> {code}
> Plan of Spark SQL:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, 
> [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery 
> subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9]
>:  :- Subquery subquery#0, [id=#23]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 
> >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), 
> avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) 
> ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= 
> 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))])
>:  :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21]
>:  :   +- HashAggregate(keys=[], functions=[partial_count(if 
> (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else 
> null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND 
> (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), 
> partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 
> 20))) ss_net_paid#38 else null))])
>:  :  +- FileScan parquet 
> default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>:  :- Subquery subquery#2, [id=#34]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(key

[jira] [Assigned] (SPARK-34083) Using TPCDS original definitions for char/varchar columns

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34083:


Assignee: (was: Apache Spark)

> Using TPCDS original definitions for char/varchar columns
> -
>
> Key: SPARK-34083
> URL: https://issues.apache.org/jira/browse/SPARK-34083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> Using TPCDS original definitions for char/varchar columns instead of the 
> modified string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34083) Using TPCDS original definitions for char/varchar columns

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263194#comment-17263194
 ] 

Apache Spark commented on SPARK-34083:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31012

> Using TPCDS original definitions for char/varchar columns
> -
>
> Key: SPARK-34083
> URL: https://issues.apache.org/jira/browse/SPARK-34083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> Using TPCDS original definitions for char/varchar columns instead of the 
> modified string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34083) Using TPCDS original definitions for char/varchar columns

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34083:


Assignee: Apache Spark

> Using TPCDS original definitions for char/varchar columns
> -
>
> Key: SPARK-34083
> URL: https://issues.apache.org/jira/browse/SPARK-34083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> Using TPCDS original definitions for char/varchar columns instead of the 
> modified string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34083) Using TPCDS original definitions for char/varchar columns

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263195#comment-17263195
 ] 

Apache Spark commented on SPARK-34083:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31012

> Using TPCDS original definitions for char/varchar columns
> -
>
> Key: SPARK-34083
> URL: https://issues.apache.org/jira/browse/SPARK-34083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> Using TPCDS original definitions for char/varchar columns instead of the 
> modified string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34075) Hidden directories are being listed for partition inference

2021-01-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34075:
-
Target Version/s: 3.1.1

> Hidden directories are being listed for partition inference
> ---
>
> Key: SPARK-34075
> URL: https://issues.apache.org/jira/browse/SPARK-34075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Marking this as a blocker since it seems to be a regression. We are running 
> Delta's tests against Spark 3.1 as part of QA here: 
> [https://github.com/delta-io/delta/pull/579]
>  
> We have noticed that one of our tests regressed with:
> {code:java}
> java.lang.AssertionError: assertion failed: Conflicting directory structures 
> detected. Suspicious paths:
> [info]
> file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551
> [info]
> file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551/_delta_log
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:223)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:172)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:104)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:158)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:73)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:167)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418)
> [info]   at 
> org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:62)
> [info]   at 
> org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:45)
> [info]   at 
> org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:40)
> [info]   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
> [info]   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> [info]   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> [info]   at scala.collection.immutable.List.foldLeft(List.scala:89)
> [info]   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
> [info]   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSam

[jira] [Assigned] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33867:


Assignee: Apache Spark

> java.time.Instant and java.time.LocalDate not handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue
> ---
>
> Key: SPARK-33867
> URL: https://issues.apache.org/jira/browse/SPARK-33867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Cristi
>Assignee: Apache Spark
>Priority: Major
>
> When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
> LocalDate and Instant aren't handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
> when they are used in filters since a filter condition would be translated to 
> something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.
> To reproduce you can write a simple filter like where dataset is backed by a 
> DB table (in my case PostgreSQL): 
> dataset.filter(current_timestamp().gt(col(VALID_FROM)))
> The error and stacktrace:
> Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
> "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
> near "T11"  Position: 285 at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) 
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33867:


Assignee: (was: Apache Spark)

> java.time.Instant and java.time.LocalDate not handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue
> ---
>
> Key: SPARK-33867
> URL: https://issues.apache.org/jira/browse/SPARK-33867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Cristi
>Priority: Major
>
> When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
> LocalDate and Instant aren't handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
> when they are used in filters since a filter condition would be translated to 
> something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.
> To reproduce you can write a simple filter like where dataset is backed by a 
> DB table (in my case PostgreSQL): 
> dataset.filter(current_timestamp().gt(col(VALID_FROM)))
> The error and stacktrace:
> Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
> "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
> near "T11"  Position: 285 at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) 
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263207#comment-17263207
 ] 

Apache Spark commented on SPARK-33867:
--

User 'cristichircu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31148

> java.time.Instant and java.time.LocalDate not handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue
> ---
>
> Key: SPARK-33867
> URL: https://issues.apache.org/jira/browse/SPARK-33867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Cristi
>Priority: Major
>
> When using the new java time API (spark.sql.datetime.java8API.enabled=true) 
> LocalDate and Instant aren't handled in 
> org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown 
> when they are used in filters since a filter condition would be translated to 
> something like this: "valid_from" > 2020-12-21T11:40:24.413681Z.
> To reproduce you can write a simple filter like where dataset is backed by a 
> DB table (in my case PostgreSQL): 
> dataset.filter(current_timestamp().gt(col(VALID_FROM)))
> The error and stacktrace:
> Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near 
> "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or 
> near "T11"  Position: 285 at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) 
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
>  at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263222#comment-17263222
 ] 

Apache Spark commented on SPARK-34084:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31149

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34084:


Assignee: Apache Spark

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34084:


Assignee: (was: Apache Spark)

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34085) History server missing failed stage

2021-01-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34085:
---

 Summary: History server missing failed stage
 Key: SPARK-34085
 URL: https://issues.apache.org/jira/browse/SPARK-34085
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yuming Wang


It is missing the failed stage(261716).

!image-2021-01-12-18-28-34-153.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34085) History server missing failed stage

2021-01-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34085:

Description: 
It is missing the failed stage(261716).

!image-2021-01-12-18-30-45-862.png!

  was:
It is missing the failed stage(261716).

!image-2021-01-12-18-28-34-153.png!


> History server missing failed stage
> ---
>
> Key: SPARK-34085
> URL: https://issues.apache.org/jira/browse/SPARK-34085
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2021-01-12-18-30-45-862.png
>
>
> It is missing the failed stage(261716).
> !image-2021-01-12-18-30-45-862.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34085) History server missing failed stage

2021-01-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34085:

Attachment: image-2021-01-12-18-30-45-862.png

> History server missing failed stage
> ---
>
> Key: SPARK-34085
> URL: https://issues.apache.org/jira/browse/SPARK-34085
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2021-01-12-18-30-45-862.png
>
>
> It is missing the failed stage(261716).
> !image-2021-01-12-18-28-34-153.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34086) RaiseError generates too much code and may fails codegen

2021-01-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-34086:


 Summary: RaiseError generates too much code and may fails codegen
 Key: SPARK-34086
 URL: https://issues.apache.org/jira/browse/SPARK-34086
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

We can reduce more than 8000 bytes by removing the unnecessary CONCAT 
expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar

2021-01-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34086:
-
Summary: RaiseError generates too much code and may fails codegen in length 
check for char varchar  (was: RaiseError generates too much code and may fails 
codegen)

> RaiseError generates too much code and may fails codegen in length check for 
> char varchar
> -
>
> Key: SPARK-34086
> URL: https://issues.apache.org/jira/browse/SPARK-34086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/
> We can reduce more than 8000 bytes by removing the unnecessary CONCAT 
> expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263251#comment-17263251
 ] 

Apache Spark commented on SPARK-34086:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31150

> RaiseError generates too much code and may fails codegen in length check for 
> char varchar
> -
>
> Key: SPARK-34086
> URL: https://issues.apache.org/jira/browse/SPARK-34086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/
> We can reduce more than 8000 bytes by removing the unnecessary CONCAT 
> expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34086:


Assignee: Apache Spark

> RaiseError generates too much code and may fails codegen in length check for 
> char varchar
> -
>
> Key: SPARK-34086
> URL: https://issues.apache.org/jira/browse/SPARK-34086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/
> We can reduce more than 8000 bytes by removing the unnecessary CONCAT 
> expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34086:


Assignee: (was: Apache Spark)

> RaiseError generates too much code and may fails codegen in length check for 
> char varchar
> -
>
> Key: SPARK-34086
> URL: https://issues.apache.org/jira/browse/SPARK-34086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/
> We can reduce more than 8000 bytes by removing the unnecessary CONCAT 
> expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)
Fu Chen created SPARK-34087:
---

 Summary: a memory leak occurs when we clone the spark session
 Key: SPARK-34087
 URL: https://issues.apache.org/jira/browse/SPARK-34087
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Fu Chen


In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
because a new ExecutionListenerBus instance will add to AsyncEventQueue when we 
clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fu Chen updated SPARK-34087:

Attachment: (was: 1610451044690.jpg)

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fu Chen updated SPARK-34087:

Attachment: 1610451044690.jpg

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263264#comment-17263264
 ] 

Fu Chen commented on SPARK-34087:
-

*bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

*!1610451044690.jpg!*

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266
 ] 

Fu Chen commented on SPARK-34087:
-

*bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

  !1610451044690.jpg!

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fu Chen updated SPARK-34087:

Comment: was deleted

(was: *bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

*!1610451044690.jpg!*

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object)

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fu Chen updated SPARK-34087:

Attachment: 1610451044690.jpg

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266
 ] 

Fu Chen edited comment on SPARK-34087 at 1/12/21, 12:02 PM:


*bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
SparkSession.clearActiveSession()
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

  !1610451044690.jpg!

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object


was (Author: fchen):
*bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

  !1610451044690.jpg!

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"

2021-01-12 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34088:
-
Summary: Rename all decommission configurations to use the same namespace 
"spark.decommission.*"  (was: Rename all decommission configurations to fix 
same namespace "spark.decommission.*")

> Rename all decommission configurations to use the same namespace 
> "spark.decommission.*"
> ---
>
> Key: SPARK-34088
> URL: https://issues.apache.org/jira/browse/SPARK-34088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34088) Rename all decommission configurations to fix same namespace "spark.decommission.*"

2021-01-12 Thread wuyi (Jira)
wuyi created SPARK-34088:


 Summary: Rename all decommission configurations to fix same 
namespace "spark.decommission.*"
 Key: SPARK-34088
 URL: https://issues.apache.org/jira/browse/SPARK-34088
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Fu Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266
 ] 

Fu Chen edited comment on SPARK-34087 at 1/12/21, 12:03 PM:


*bug replay*

here is code for replay this bug:
{code:java}
// run with spark-3.0.1
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
SparkSession.clearActiveSession()
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

  !1610451044690.jpg!

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object


was (Author: fchen):
*bug replay*

here is code for replay this bug:
{code:java}
test("bug replay") {
  (1 to 1000).foreach(i => { 
spark.cloneSession() 
SparkSession.clearActiveSession()
  })
  val cnt = spark.sparkContext
   .listenerBus
   .listeners
   .asScala
   .collect{ case e: ExecutionListenerBus => e}
   .size
  println(s"total ExecutionListenerBus count ${cnt}.")
  Thread.sleep(Int.MaxValue) 
}
{code}
*output:*

total ExecutionListenerBus count 1001.

*jmap*

  !1610451044690.jpg!

Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect 
these SparkSession object

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"

2021-01-12 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34088:
-
Description: 
Currently, decommission configurations are using difference namespaces, e.g., 
 * spark.decommission

 * spark.storage.decommission

 * spark.executor.decommission

which may introduce unnecessary overhead for end-users. It's better to keep 
them under the same namespace.

> Rename all decommission configurations to use the same namespace 
> "spark.decommission.*"
> ---
>
> Key: SPARK-34088
> URL: https://issues.apache.org/jira/browse/SPARK-34088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, decommission configurations are using difference namespaces, e.g., 
>  * spark.decommission
>  * spark.storage.decommission
>  * spark.executor.decommission
> which may introduce unnecessary overhead for end-users. It's better to keep 
> them under the same namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34088:


Assignee: Apache Spark

> Rename all decommission configurations to use the same namespace 
> "spark.decommission.*"
> ---
>
> Key: SPARK-34088
> URL: https://issues.apache.org/jira/browse/SPARK-34088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Currently, decommission configurations are using difference namespaces, e.g., 
>  * spark.decommission
>  * spark.storage.decommission
>  * spark.executor.decommission
> which may introduce unnecessary overhead for end-users. It's better to keep 
> them under the same namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263289#comment-17263289
 ] 

Apache Spark commented on SPARK-34088:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31151

> Rename all decommission configurations to use the same namespace 
> "spark.decommission.*"
> ---
>
> Key: SPARK-34088
> URL: https://issues.apache.org/jira/browse/SPARK-34088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, decommission configurations are using difference namespaces, e.g., 
>  * spark.decommission
>  * spark.storage.decommission
>  * spark.executor.decommission
> which may introduce unnecessary overhead for end-users. It's better to keep 
> them under the same namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34088:


Assignee: (was: Apache Spark)

> Rename all decommission configurations to use the same namespace 
> "spark.decommission.*"
> ---
>
> Key: SPARK-34088
> URL: https://issues.apache.org/jira/browse/SPARK-34088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, decommission configurations are using difference namespaces, e.g., 
>  * spark.decommission
>  * spark.storage.decommission
>  * spark.executor.decommission
> which may introduce unnecessary overhead for end-users. It's better to keep 
> them under the same namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode

2021-01-12 Thread wuyi (Jira)
wuyi created SPARK-34089:


 Summary: MemoryConsumer's memory mode should respect 
MemoryManager's memory mode
 Key: SPARK-34089
 URL: https://issues.apache.org/jira/browse/SPARK-34089
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.7, 3.1.0
Reporter: wuyi


Currently, the memory mode always set to ON_HEAP for memory consumer when it's 
not explicitly set.

However, we actually can know the specific memory mode by 
taskMemoryManager.getTungstenMemoryMode().

 

[https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263316#comment-17263316
 ] 

Apache Spark commented on SPARK-34089:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31152

> MemoryConsumer's memory mode should respect MemoryManager's memory mode
> ---
>
> Key: SPARK-34089
> URL: https://issues.apache.org/jira/browse/SPARK-34089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the memory mode always set to ON_HEAP for memory consumer when 
> it's not explicitly set.
> However, we actually can know the specific memory mode by 
> taskMemoryManager.getTungstenMemoryMode().
>  
> [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34089:


Assignee: (was: Apache Spark)

> MemoryConsumer's memory mode should respect MemoryManager's memory mode
> ---
>
> Key: SPARK-34089
> URL: https://issues.apache.org/jira/browse/SPARK-34089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the memory mode always set to ON_HEAP for memory consumer when 
> it's not explicitly set.
> However, we actually can know the specific memory mode by 
> taskMemoryManager.getTungstenMemoryMode().
>  
> [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34089:


Assignee: Apache Spark

> MemoryConsumer's memory mode should respect MemoryManager's memory mode
> ---
>
> Key: SPARK-34089
> URL: https://issues.apache.org/jira/browse/SPARK-34089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the memory mode always set to ON_HEAP for memory consumer when 
> it's not explicitly set.
> However, we actually can know the specific memory mode by 
> taskMemoryManager.getTungstenMemoryMode().
>  
> [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263318#comment-17263318
 ] 

Apache Spark commented on SPARK-34089:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31152

> MemoryConsumer's memory mode should respect MemoryManager's memory mode
> ---
>
> Key: SPARK-34089
> URL: https://issues.apache.org/jira/browse/SPARK-34089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the memory mode always set to ON_HEAP for memory consumer when 
> it's not explicitly set.
> However, we actually can know the specific memory mode by 
> taskMemoryManager.getTungstenMemoryMode().
>  
> [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegati

2021-01-12 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-34090:
-

 Summary: HadoopDelegationTokenManager.isServiceEnabled used in 
KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
stream processing in case of delegation token
 Key: SPARK-34090
 URL: https://issues.apache.org/jira/browse/SPARK-34090
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.1
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread wuyi (Jira)
wuyi created SPARK-34091:


 Summary: Shuffle batch fetch can't be disabled once it's enabled 
previously
 Key: SPARK-34091
 URL: https://issues.apache.org/jira/browse/SPARK-34091
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: wuyi


{code:java}
  if (SQLConf.get.fetchShuffleBlocksInBatch) {
dependency.rdd.context.setLocalProperty(
  SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
  }
{code}
The current code has a problem that once we set `fetchShuffleBlocksInBatch` to 
true first, we can never disable the batch fetch even if set 
`fetchShuffleBlocksInBatch` to false later.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delega

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263336#comment-17263336
 ] 

Apache Spark commented on SPARK-34090:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/31154

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34090:


Assignee: Apache Spark

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34090:


Assignee: (was: Apache Spark)

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delega

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263338#comment-17263338
 ] 

Apache Spark commented on SPARK-34090:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/31154

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34091:


Assignee: (was: Apache Spark)

> Shuffle batch fetch can't be disabled once it's enabled previously
> --
>
> Key: SPARK-34091
> URL: https://issues.apache.org/jira/browse/SPARK-34091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
>   if (SQLConf.get.fetchShuffleBlocksInBatch) {
> dependency.rdd.context.setLocalProperty(
>   SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
>   }
> {code}
> The current code has a problem that once we set `fetchShuffleBlocksInBatch` 
> to true first, we can never disable the batch fetch even if set 
> `fetchShuffleBlocksInBatch` to false later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34091:


Assignee: Apache Spark

> Shuffle batch fetch can't be disabled once it's enabled previously
> --
>
> Key: SPARK-34091
> URL: https://issues.apache.org/jira/browse/SPARK-34091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
>   if (SQLConf.get.fetchShuffleBlocksInBatch) {
> dependency.rdd.context.setLocalProperty(
>   SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
>   }
> {code}
> The current code has a problem that once we set `fetchShuffleBlocksInBatch` 
> to true first, we can never disable the batch fetch even if set 
> `fetchShuffleBlocksInBatch` to false later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263343#comment-17263343
 ] 

Apache Spark commented on SPARK-34091:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31155

> Shuffle batch fetch can't be disabled once it's enabled previously
> --
>
> Key: SPARK-34091
> URL: https://issues.apache.org/jira/browse/SPARK-34091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
>   if (SQLConf.get.fetchShuffleBlocksInBatch) {
> dependency.rdd.context.setLocalProperty(
>   SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
>   }
> {code}
> The current code has a problem that once we set `fetchShuffleBlocksInBatch` 
> to true first, we can never disable the batch fetch even if set 
> `fetchShuffleBlocksInBatch` to false later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34053) Please reduce GitHub Actions matrix or improve the build time

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263347#comment-17263347
 ] 

Apache Spark commented on SPARK-34053:
--

User 'potiuk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31153

> Please reduce GitHub Actions matrix or improve the build time
> -
>
> Key: SPARK-34053
> URL: https://issues.apache.org/jira/browse/SPARK-34053
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Vladimir Sitnikov
>Assignee: Kamil Bregula
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2021-01-08 at 2.57.07 PM.png
>
>
> GitHub Actions queue is very high for Apache projects, and it looks like a 
> significant number of executors are occupied by Spark jobs :-(
> Note: all Apache projects share the same limit of shared GitHub Actions 
> runners, and based on the chart below Spark is consuming 20+ runners while 
> the total limit (for all ASF projects) is 180.
> See 
> https://lists.apache.org/thread.html/r5303eec41cc1dfc51c15dbe44770e37369330f9644ef09813f649120%40%3Cbuilds.apache.org%3E
> >number of GA workflows in/progress/queued per project and they clearly show 
> >the situation is getting worse by day: https://pasteboard.co/JIJa5Xg.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34053) Please reduce GitHub Actions matrix or improve the build time

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263346#comment-17263346
 ] 

Apache Spark commented on SPARK-34053:
--

User 'potiuk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31153

> Please reduce GitHub Actions matrix or improve the build time
> -
>
> Key: SPARK-34053
> URL: https://issues.apache.org/jira/browse/SPARK-34053
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Vladimir Sitnikov
>Assignee: Kamil Bregula
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2021-01-08 at 2.57.07 PM.png
>
>
> GitHub Actions queue is very high for Apache projects, and it looks like a 
> significant number of executors are occupied by Spark jobs :-(
> Note: all Apache projects share the same limit of shared GitHub Actions 
> runners, and based on the chart below Spark is consuming 20+ runners while 
> the total limit (for all ASF projects) is 180.
> See 
> https://lists.apache.org/thread.html/r5303eec41cc1dfc51c15dbe44770e37369330f9644ef09813f649120%40%3Cbuilds.apache.org%3E
> >number of GA workflows in/progress/queued per project and they clearly show 
> >the situation is getting worse by day: https://pasteboard.co/JIJa5Xg.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34084:
---

Assignee: Maxim Gekk

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34084.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31149
[https://github.com/apache/spark/pull/31149]

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31144) Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure

2021-01-12 Thread Alex Vayda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263425#comment-17263425
 ] 

Alex Vayda commented on SPARK-31144:


Don't you think that wrapping an {{Error}} into {{Exception}}, just to be able 
to pass it into the method that, strictly speaking, doesn't expect to be called 
with an {{Error}}, would break the method semantics?

Wouldn't it be better to introduce another (third) method, say `onFatal(..., 
th: Throwable)` with an empty default implementation (for API backward 
compatibility), that would be called on errors, that are considered to be fatal 
from the Java/Scala perspective? See 
https://www.scala-lang.org/api/2.12.0/scala/util/control/NonFatal$.html

> Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure
> ---
>
> Key: SPARK-31144
> URL: https://issues.apache.org/jira/browse/SPARK-31144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> SPARK-28556 changed the method QueryExecutionListener.onFailure to allow 
> Spark sending java.lang.Error to this method. As this change breaks APIs, we 
> cannot fix branch-2.4.
> [~marmbrus] suggested to wrap java.lang.Error with an exception instead to 
> avoid a breaking change. A bonus of this solution is we can also fix the 
> issue (if a query throws java.lang.Error, QueryExecutionListener doesn't get 
> notified) in branch-2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34091:
---

Assignee: wuyi

> Shuffle batch fetch can't be disabled once it's enabled previously
> --
>
> Key: SPARK-34091
> URL: https://issues.apache.org/jira/browse/SPARK-34091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> {code:java}
>   if (SQLConf.get.fetchShuffleBlocksInBatch) {
> dependency.rdd.context.setLocalProperty(
>   SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
>   }
> {code}
> The current code has a problem that once we set `fetchShuffleBlocksInBatch` 
> to true first, we can never disable the batch fetch even if set 
> `fetchShuffleBlocksInBatch` to false later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously

2021-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34091.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31155
[https://github.com/apache/spark/pull/31155]

> Shuffle batch fetch can't be disabled once it's enabled previously
> --
>
> Key: SPARK-34091
> URL: https://issues.apache.org/jira/browse/SPARK-34091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.1
>
>
> {code:java}
>   if (SQLConf.get.fetchShuffleBlocksInBatch) {
> dependency.rdd.context.setLocalProperty(
>   SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true")
>   }
> {code}
> The current code has a problem that once we set `fetchShuffleBlocksInBatch` 
> to true first, we can never disable the batch fetch even if set 
> `fetchShuffleBlocksInBatch` to false later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34087:


Assignee: Apache Spark

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Assignee: Apache Spark
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34087:


Assignee: (was: Apache Spark)

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263498#comment-17263498
 ] 

Apache Spark commented on SPARK-34087:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/31156

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263499#comment-17263499
 ] 

Apache Spark commented on SPARK-34087:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/31156

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263571#comment-17263571
 ] 

Apache Spark commented on SPARK-34084:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31157

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263581#comment-17263581
 ] 

Apache Spark commented on SPARK-34084:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31158

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263582#comment-17263582
 ] 

Apache Spark commented on SPARK-34084:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31158

> ALTER TABLE .. ADD PARTITION does not update table stats
> 
>
> Key: SPARK-34084
> URL: https://issues.apache.org/jira/browse/SPARK-34084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Environment: strong text
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The example below portraits the issue:
> {code:sql}
> spark-sql> create table tbl (col0 int, part int) partitioned by (part);
> spark-sql> insert into tbl partition (part = 0) select 0;
> spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true;
> spark-sql> alter table tbl add partition (part = 1);
> {code}
> There are no stats:
> {code:sql}
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> Database  default
> Table tbl
> Owner maximgekk
> Created Time  Tue Jan 12 12:00:03 MSK 2021
> Last Access   UNKNOWN
> Created BySpark 3.2.0-SNAPSHOT
> Type  MANAGED
> Provider  hive
> Table Properties  [transient_lastDdlTime=1610442003]
> Location  
> file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties[serialization.format=1]
> Partition ProviderCatalog
> {code}
> *As we can see there is no stats.* For instance, ALTER TABLE .. DROP 
> PARTITION updates stats:
> {code:sql}
> spark-sql> alter table tbl drop partition (part = 1);
> spark-sql> describe table extended tbl;
> col0  int NULL
> part  int NULL
> # Partition Information
> # col_namedata_type   comment
> part  int NULL
> # Detailed Table Information
> ...
> Statistics2 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34069) Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL

2021-01-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-34069.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31127
[https://github.com/apache/spark/pull/31127]

> Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
> ---
>
> Key: SPARK-34069
> URL: https://issues.apache.org/jira/browse/SPARK-34069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> We should interrupt task thread if user set local property 
> `SPARK_JOB_INTERRUPT_ON_CANCEL` to true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34069) Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL

2021-01-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-34069:
---

Assignee: ulysses you

> Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
> ---
>
> Key: SPARK-34069
> URL: https://issues.apache.org/jira/browse/SPARK-34069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
>
> We should interrupt task thread if user set local property 
> `SPARK_JOB_INTERRUPT_ON_CANCEL` to true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Update commons-crypto to v1.1.0

2021-01-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32691:
--
Fix Version/s: 3.0.2

> Update commons-crypto to v1.1.0
> ---
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: huangtianhua
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32338) Add overload for slice that accepts Columns or Int

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263725#comment-17263725
 ] 

Apache Spark commented on SPARK-32338:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/31159

> Add overload for slice that accepts Columns or Int
> --
>
> Key: SPARK-32338
> URL: https://issues.apache.org/jira/browse/SPARK-32338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nikolas Vanderhoof
>Assignee: Nikolas Vanderhoof
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Add an overload for org.apache.spark.sql.functions.slice with the following 
> signature:
> {code:scala}
> def slice(x: Column, start: Any, length: Any): Column = ???
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32338) Add overload for slice that accepts Columns or Int

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263724#comment-17263724
 ] 

Apache Spark commented on SPARK-32338:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/31159

> Add overload for slice that accepts Columns or Int
> --
>
> Key: SPARK-32338
> URL: https://issues.apache.org/jira/browse/SPARK-32338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nikolas Vanderhoof
>Assignee: Nikolas Vanderhoof
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Add an overload for org.apache.spark.sql.functions.slice with the following 
> signature:
> {code:scala}
> def slice(x: Column, start: Any, length: Any): Column = ???
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263754#comment-17263754
 ] 

Apache Spark commented on SPARK-34080:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31160

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34080:


Assignee: Apache Spark

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263752#comment-17263752
 ] 

Apache Spark commented on SPARK-34080:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31160

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34080:


Assignee: (was: Apache Spark)

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34051) Support 32-bit unicode escape in string literals

2021-01-12 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34051:
---
Priority: Minor  (was: Major)

> Support 32-bit unicode escape in string literals
> 
>
> Key: SPARK-34051
> URL: https://issues.apache.org/jira/browse/SPARK-34051
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Currently, Spark supports 16-bit unicode escape like "\u0041" in string 
> literals.
> I think It's nice if 32-bit unicode is also supported like PostgreSQL and 
> modern programming languages do (e.g, C++11, Rust).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegati

2021-01-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34090:
-
Fix Version/s: (was: 3.1.0)
   3.1.1

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat

2021-01-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34090:


Assignee: Gabor Somogyi

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat

2021-01-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34090.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31154
[https://github.com/apache/spark/pull/31154]

> HadoopDelegationTokenManager.isServiceEnabled used in 
> KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
> stream processing in case of delegation token
> -
>
> Key: SPARK-34090
> URL: https://issues.apache.org/jira/browse/SPARK-34090
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-01-12 Thread Stephen Kestle (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263837#comment-17263837
 ] 

Stephen Kestle commented on SPARK-25075:


[~dongjoon] can you add context about points 1 and 2 please (why can't/aren't 
they published)? 

I'm guessing that the current focus is on 3.1.0 RCs (which is not scala 2.13).

Does a 3.2.0_2.13 snapshot require 3.1.0 being released? 
When might it start? (And would this support be in an early milestone)?

A month ago, I told myself that this was likely to be 6-12 months away, but on 
recent inspection, perhaps I should expect _something_ a bit sooner (would help 
start to integrate my code bases)

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34033) SparkR Daemon Initialization

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263841#comment-17263841
 ] 

Apache Spark commented on SPARK-34033:
--

User 'WamBamBoozle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31162

> SparkR Daemon Initialization
> 
>
> Key: SPARK-34033
> URL: https://issues.apache.org/jira/browse/SPARK-34033
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Affects Versions: 3.2.0
> Environment: tested on centos 7 & spark 2.3.1 and on my mac & spark 
> at master
>Reporter: Tom Howland
>Priority: Major
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Provide a way for users to initialize the sparkR daemon before it forks.
> I'm a contractor to Target, where we have several projects doing ML with 
> sparkR. The changes proposed here results in weeks of compute-time saved with 
> every run.
> (4 partitions) * (5 seconds to load our R libraries) * (2 calls to gapply 
> in our app) / 60 / 60 = 111 hours.
> (from 
> [docs/sparkr.md|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization])
> h3. Daemon Initialization
> If your worker function has a lengthy initialization, and your
>  application has lots of partitions, you may find you are spending weeks
>  of compute time repeatedly doing something that should have taken a few
>  seconds during daemon initialization.
> Every Spark executor spawns a process running an R daemon. The daemon
>  "forks a copy" of itself whenever Spark finds work for it to do. It may
>  be applying a predefined method such as "max", or it may be applying
>  your worker function. SparkR::gapply arranges things so that your worker
>  function will be called with each group. A group is the pair
>  Key-Seq[Row]. In the absence of partitioning, the daemon will fork for
>  every group found. With partitioning, the daemon will fork for every
>  partition found. A partition may have several groups in it.
> All the initializations and library loading your worker function manages
>  is thrown away when the fork concludes. Every fork has to be
>  initialized.
> The configuration spark.r.daemonInit provides a way to avoid reloading
>  packages every time the daemon forks by having the daemon pre-load
>  packages. You do this by providing R code to initialize the daemon for
>  your application.
> h4. Examples
> Suppose we want library(wow) to be pre-loaded for our workers.
> {{sparkR.session(spark.r.daemonInit = 'library(wow)')}}
> of course, that would only work if we knew that library(wow) was on our
>  path and available on the executor. If we have to ship the library, we
>  can use YARN
> sparkR.session(
>    master = 'yarn',
>    spark.r.daemonInit = '.libPaths(c("wowTarget", .libPaths())); 
> library(wow)',
>    spark.submit.deployMode = 'client',
>    spark.yarn.dist.archives = 'wow.zip#wowTarget')
> YARN creates a directory for the new executor, unzips 'wow.zip' in some
>  other directory, and then provides a symlink to it called
>  ./wowTarget. When the executor starts the daemon, the daemon loads
>  library(wow) from the newly created wowTarget.
> Warning: if your initialization takes longer than 10 seconds, consider
>  increasing the configuration 
> [spark.r.daemonTimeout](configuration.md#sparkr).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34085) History server missing failed stage

2021-01-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34085.
-
Resolution: Invalid

> History server missing failed stage
> ---
>
> Key: SPARK-34085
> URL: https://issues.apache.org/jira/browse/SPARK-34085
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2021-01-12-18-30-45-862.png
>
>
> It is missing the failed stage(261716).
> !image-2021-01-12-18-30-45-862.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34092) Stage level RestFul API support filter by task status

2021-01-12 Thread angerszhu (Jira)
angerszhu created SPARK-34092:
-

 Summary: Stage level RestFul API support filter by task status
 Key: SPARK-34092
 URL: https://issues.apache.org/jira/browse/SPARK-34092
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: angerszhu


Support filter Task by taskstatus when details is true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34093) param maxDepth should check upper bound

2021-01-12 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-34093:


 Summary: param maxDepth should check upper bound
 Key: SPARK-34093
 URL: https://issues.apache.org/jira/browse/SPARK-34093
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34093) param maxDepth should check upper bound

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34093:


Assignee: (was: Apache Spark)

> param maxDepth should check upper bound
> ---
>
> Key: SPARK-34093
> URL: https://issues.apache.org/jira/browse/SPARK-34093
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34093) param maxDepth should check upper bound

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263872#comment-17263872
 ] 

Apache Spark commented on SPARK-34093:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31163

> param maxDepth should check upper bound
> ---
>
> Key: SPARK-34093
> URL: https://issues.apache.org/jira/browse/SPARK-34093
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34093) param maxDepth should check upper bound

2021-01-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263871#comment-17263871
 ] 

Apache Spark commented on SPARK-34093:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31163

> param maxDepth should check upper bound
> ---
>
> Key: SPARK-34093
> URL: https://issues.apache.org/jira/browse/SPARK-34093
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34093) param maxDepth should check upper bound

2021-01-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34093:


Assignee: Apache Spark

> param maxDepth should check upper bound
> ---
>
> Key: SPARK-34093
> URL: https://issues.apache.org/jira/browse/SPARK-34093
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= 0x10000

2021-01-12 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-34094:
--

 Summary: Extends StringTranslate to support unicode characters 
whose code point >= 0x1
 Key: SPARK-34094
 URL: https://issues.apache.org/jira/browse/SPARK-34094
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Currently, StringTranslate works with only unicode characters whose code point 
< 0x1 so let's extend it to support code points >= 0x1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= U+10000

2021-01-12 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34094:
---
Summary: Extends StringTranslate to support unicode characters whose code 
point >= U+1  (was: Extends StringTranslate to support unicode characters 
whose code point >= 0x1)

> Extends StringTranslate to support unicode characters whose code point >= 
> U+1
> -
>
> Key: SPARK-34094
> URL: https://issues.apache.org/jira/browse/SPARK-34094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Currently, StringTranslate works with only unicode characters whose code 
> point < 0x1 so let's extend it to support code points >= 0x1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >