[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263152#comment-17263152 ] Apache Spark commented on SPARK-32380: -- User 'yangBottle' has created a pull request for this issue: https://github.com/apache/spark/pull/31147 > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.It
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263153#comment-17263153 ] Apache Spark commented on SPARK-32380: -- User 'yangBottle' has created a pull request for this issue: https://github.com/apache/spark/pull/31147 > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.It
[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristi updated SPARK-33867: --- Description: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. To reproduce you can write a simple filter like: dataset.filter(current_timestamp().gt(col(VALID_FROM))) The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) was: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "validity_end" > 2020-12-21T11:40:24.413681Z The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$
[jira] [Commented] (SPARK-34079) Improvement CTE table scan
[ https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263168#comment-17263168 ] Peter Toth commented on SPARK-34079: Thanks for pinging me [~yumwang]. I'm happy to work on this if you haven't started it yet. > Improvement CTE table scan > -- > > Key: SPARK-34079 > URL: https://issues.apache.org/jira/browse/SPARK-34079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Prepare table: > {code:sql} > CREATE TABLE store_sales ( ss_sold_date_sk INT, ss_sold_time_sk INT, > ss_item_sk INT, ss_customer_sk INT, ss_cdemo_sk INT, ss_hdemo_sk INT, > ss_addr_sk INT, ss_store_sk INT, ss_promo_sk INT, ss_ticket_number INT, > ss_quantity INT, ss_wholesale_cost DECIMAL(7,2), ss_list_price > DECIMAL(7,2), ss_sales_price DECIMAL(7,2), ss_ext_discount_amt > DECIMAL(7,2), ss_ext_sales_price DECIMAL(7,2), ss_ext_wholesale_cost > DECIMAL(7,2), ss_ext_list_price DECIMAL(7,2), ss_ext_tax DECIMAL(7,2), > ss_coupon_amt DECIMAL(7,2), ss_net_paid DECIMAL(7,2), ss_net_paid_inc_tax > DECIMAL(7,2),ss_net_profit DECIMAL(7,2)); > CREATE TABLE reason ( r_reason_sk INT, r_reason_id varchar(255), > r_reason_desc varchar(255)); > {code} > SQL: > {code:sql} > WITH bucket_result AS ( > SELECT > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity > END)) > 62316685 > THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) > END bucket1, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN > ss_quantity END)) > 19045798 > THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) > END bucket2, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN > ss_quantity END)) > 365541424 > THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) > END bucket3, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN > ss_quantity END)) > 19045798 > THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) > END bucket4, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN > ss_quantity END)) > 365541424 > THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) > END bucket5 > FROM store_sales > ) > SELECT > (SELECT bucket1 FROM bucket_result) as bucket1, > (SELECT bucket2 FROM bucket_result) as bucket2, > (SELECT bucket3 FROM bucket_result) as bucket3, > (SELECT bucket4 FROM bucket_result) as bucket4, > (SELECT bucket5 FROM bucket_result) as bucket5 > FROM reason > WHERE r_reason_sk = 1; > {code} > Plan of Spark SQL: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, > [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery > subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9] >: :- Subquery subquery#0, [id=#23] >: : +- AdaptiveSparkPlan isFinalPlan=false >: : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 > >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), > avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) > ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= > 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))]) >: :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21] >: : +- HashAggregate(keys=[], functions=[partial_count(if > (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else > null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND > (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), > partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= > 20))) ss_net_paid#38 else null))]) >: : +- FileScan parquet > default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] > Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct >: :- Subquery subquery#2, [id=#34] >: : +- AdaptiveS
[jira] [Created] (SPARK-34083) Using TPCDS original definitions for char/varchar columns
Kent Yao created SPARK-34083: Summary: Using TPCDS original definitions for char/varchar columns Key: SPARK-34083 URL: https://issues.apache.org/jira/browse/SPARK-34083 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0, 3.2.0 Reporter: Kent Yao Using TPCDS original definitions for char/varchar columns instead of the modified string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristi updated SPARK-33867: --- Description: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. To reproduce you can write a simple filter like where dataset is backed by a DB table (in b=my case PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM))) The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) was: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. To reproduce you can write a simple filter like: dataset.filter(current_timestamp().gt(col(VALID_FROM))) The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.e
[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristi updated SPARK-33867: --- Priority: Major (was: Minor) > java.time.Instant and java.time.LocalDate not handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue > --- > > Key: SPARK-33867 > URL: https://issues.apache.org/jira/browse/SPARK-33867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Cristi >Priority: Major > > When using the new java time API (spark.sql.datetime.java8API.enabled=true) > LocalDate and Instant aren't handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown > when they are used in filters since a filter condition would be translated to > something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. > To reproduce you can write a simple filter like where dataset is backed by a > DB table (in b=my case PostgreSQL): > dataset.filter(current_timestamp().gt(col(VALID_FROM))) > The error and stacktrace: > Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near > "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or > near "T11" Position: 285 at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:127) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions
[ https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin resolved SPARK-34082. Resolution: Invalid Close it due to {{cannot resolve 'b' given input columns}} seems a correct error message. Filter should be resolved before Projection. I was confused with QUALIFY syntax in our internal Spark version. > Window expression with alias inside WHERE and HAVING clauses fail with > non-descriptive exceptions > - > > Key: SPARK-34082 > URL: https://issues.apache.org/jira/browse/SPARK-34082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Lantao Jin >Priority: Minor > > SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But > if the window expression with alias inside WHERE and HAVING clauses, Spark > does not handle this explicitly and will fail with non-descriptive exceptions. > {code} > SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1 > {code} > {code} > cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b] > {code} > {code} > SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s > FROM testData2 > GROUP BY a > HAVING SUM(b) = 5 AND s = 1 > {code} > {code} > cannot resolve '`b`' given input columns: [testdata2.a, max(b)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions
[ https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin closed SPARK-34082. -- > Window expression with alias inside WHERE and HAVING clauses fail with > non-descriptive exceptions > - > > Key: SPARK-34082 > URL: https://issues.apache.org/jira/browse/SPARK-34082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Lantao Jin >Priority: Minor > > SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But > if the window expression with alias inside WHERE and HAVING clauses, Spark > does not handle this explicitly and will fail with non-descriptive exceptions. > {code} > SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1 > {code} > {code} > cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b] > {code} > {code} > SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s > FROM testData2 > GROUP BY a > HAVING SUM(b) = 5 AND s = 1 > {code} > {code} > cannot resolve '`b`' given input columns: [testdata2.a, max(b)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34067) PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time
[ https://issues.apache.org/jira/browse/SPARK-34067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-34067: --- Affects Version/s: 3.1.1 3.2.0 3.1.0 3.0.2 > PartitionPruning push down pruningHasBenefit function into insertPredicate > function to decrease calculate time > -- > > Key: SPARK-34067 > URL: https://issues.apache.org/jira/browse/SPARK-34067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1 >Reporter: jiahong.li >Priority: Minor > > in the class PartitionPruning, function prune, push pruningHasBenefit > function into insertPredicate function, as `SQLConf.get.exchangeReuseEnabled` > is always true default and > `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly ` is always true > default. > by set hasBenefit to lazy ,we can do not need invoke hasBenefit ,so we can > save time. > solved by #31122 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
Maxim Gekk created SPARK-34084: -- Summary: ALTER TABLE .. ADD PARTITION does not update table stats Key: SPARK-34084 URL: https://issues.apache.org/jira/browse/SPARK-34084 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 3.2.0, 3.1.1 Environment: strong text Reporter: Maxim Gekk The example below portraits the issue: {code:sql} spark-sql> create table tbl (col0 int, part int) partitioned by (part); spark-sql> insert into tbl partition (part = 0) select 0; spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; spark-sql> alter table tbl add partition (part = 1); {code} There are no stats: {code:sql} spark-sql> describe table extended tbl; col0int NULL partint NULL # Partition Information # col_name data_type comment partint NULL # Detailed Table Information Databasedefault Table tbl Owner maximgekk Created TimeTue Jan 12 12:00:03 MSK 2021 Last Access UNKNOWN Created By Spark 3.2.0-SNAPSHOT TypeMANAGED Providerhive Table Properties[transient_lastDdlTime=1610442003] Location file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog {code} *As we can see there is no stats.* For instance, ALTER TABLE .. DROP PARTITION updates stats: {code:sql} spark-sql> alter table tbl drop partition (part = 1); spark-sql> describe table extended tbl; col0int NULL partint NULL # Partition Information # col_name data_type comment partint NULL # Detailed Table Information ... Statistics 2 bytes {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28065) ntile only accepting positive (>0) values
[ https://issues.apache.org/jira/browse/SPARK-28065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263187#comment-17263187 ] jiaan.geng commented on SPARK-28065: After investigation, it is found that most databases require it to be a positive integer greater than 0. > ntile only accepting positive (>0) values > - > > Key: SPARK-28065 > URL: https://issues.apache.org/jira/browse/SPARK-28065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark does not accept null as an input for `ntile`, or zero, > however Postgres supports it. > Example: > {code:sql} > SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristi updated SPARK-33867: --- Description: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. To reproduce you can write a simple filter like where dataset is backed by a DB table (in my case PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM))) The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) was: When using the new java time API (spark.sql.datetime.java8API.enabled=true) LocalDate and Instant aren't handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown when they are used in filters since a filter condition would be translated to something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. To reproduce you can write a simple filter like where dataset is backed by a DB table (in b=my case PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM))) The error and stacktrace: Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.
[jira] [Commented] (SPARK-34079) Improvement CTE table scan
[ https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263193#comment-17263193 ] Yuming Wang commented on SPARK-34079: - Thank you. Go ahead, please. > Improvement CTE table scan > -- > > Key: SPARK-34079 > URL: https://issues.apache.org/jira/browse/SPARK-34079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Prepare table: > {code:sql} > CREATE TABLE store_sales ( ss_sold_date_sk INT, ss_sold_time_sk INT, > ss_item_sk INT, ss_customer_sk INT, ss_cdemo_sk INT, ss_hdemo_sk INT, > ss_addr_sk INT, ss_store_sk INT, ss_promo_sk INT, ss_ticket_number INT, > ss_quantity INT, ss_wholesale_cost DECIMAL(7,2), ss_list_price > DECIMAL(7,2), ss_sales_price DECIMAL(7,2), ss_ext_discount_amt > DECIMAL(7,2), ss_ext_sales_price DECIMAL(7,2), ss_ext_wholesale_cost > DECIMAL(7,2), ss_ext_list_price DECIMAL(7,2), ss_ext_tax DECIMAL(7,2), > ss_coupon_amt DECIMAL(7,2), ss_net_paid DECIMAL(7,2), ss_net_paid_inc_tax > DECIMAL(7,2),ss_net_profit DECIMAL(7,2)); > CREATE TABLE reason ( r_reason_sk INT, r_reason_id varchar(255), > r_reason_desc varchar(255)); > {code} > SQL: > {code:sql} > WITH bucket_result AS ( > SELECT > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity > END)) > 62316685 > THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) > END bucket1, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN > ss_quantity END)) > 19045798 > THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) > END bucket2, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN > ss_quantity END)) > 365541424 > THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) > END bucket3, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN > ss_quantity END)) > 19045798 > THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) > END bucket4, > CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN > ss_quantity END)) > 365541424 > THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN > ss_ext_discount_amt END)) > ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) > END bucket5 > FROM store_sales > ) > SELECT > (SELECT bucket1 FROM bucket_result) as bucket1, > (SELECT bucket2 FROM bucket_result) as bucket2, > (SELECT bucket3 FROM bucket_result) as bucket3, > (SELECT bucket4 FROM bucket_result) as bucket4, > (SELECT bucket5 FROM bucket_result) as bucket5 > FROM reason > WHERE r_reason_sk = 1; > {code} > Plan of Spark SQL: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, > [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery > subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9] >: :- Subquery subquery#0, [id=#23] >: : +- AdaptiveSparkPlan isFinalPlan=false >: : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 > >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), > avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) > ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= > 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))]) >: :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21] >: : +- HashAggregate(keys=[], functions=[partial_count(if > (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else > null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND > (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), > partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= > 20))) ss_net_paid#38 else null))]) >: : +- FileScan parquet > default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] > Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct >: :- Subquery subquery#2, [id=#34] >: : +- AdaptiveSparkPlan isFinalPlan=false >: : +- HashAggregate(key
[jira] [Assigned] (SPARK-34083) Using TPCDS original definitions for char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34083: Assignee: (was: Apache Spark) > Using TPCDS original definitions for char/varchar columns > - > > Key: SPARK-34083 > URL: https://issues.apache.org/jira/browse/SPARK-34083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Priority: Major > > Using TPCDS original definitions for char/varchar columns instead of the > modified string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34083) Using TPCDS original definitions for char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263194#comment-17263194 ] Apache Spark commented on SPARK-34083: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31012 > Using TPCDS original definitions for char/varchar columns > - > > Key: SPARK-34083 > URL: https://issues.apache.org/jira/browse/SPARK-34083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Priority: Major > > Using TPCDS original definitions for char/varchar columns instead of the > modified string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34083) Using TPCDS original definitions for char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34083: Assignee: Apache Spark > Using TPCDS original definitions for char/varchar columns > - > > Key: SPARK-34083 > URL: https://issues.apache.org/jira/browse/SPARK-34083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > Using TPCDS original definitions for char/varchar columns instead of the > modified string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34083) Using TPCDS original definitions for char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-34083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263195#comment-17263195 ] Apache Spark commented on SPARK-34083: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31012 > Using TPCDS original definitions for char/varchar columns > - > > Key: SPARK-34083 > URL: https://issues.apache.org/jira/browse/SPARK-34083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Priority: Major > > Using TPCDS original definitions for char/varchar columns instead of the > modified string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34075) Hidden directories are being listed for partition inference
[ https://issues.apache.org/jira/browse/SPARK-34075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34075: - Target Version/s: 3.1.1 > Hidden directories are being listed for partition inference > --- > > Key: SPARK-34075 > URL: https://issues.apache.org/jira/browse/SPARK-34075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Burak Yavuz >Priority: Blocker > > Marking this as a blocker since it seems to be a regression. We are running > Delta's tests against Spark 3.1 as part of QA here: > [https://github.com/delta-io/delta/pull/579] > > We have noticed that one of our tests regressed with: > {code:java} > java.lang.AssertionError: assertion failed: Conflicting directory structures > detected. Suspicious paths: > [info] > file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551 > [info] > file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551/_delta_log > [info] > [info] If provided paths are partition directories, please set "basePath" in > the options of the data source to specify the root directory of the table. If > there are multiple root directories, please load them separately and then > union them. > [info] at scala.Predef$.assert(Predef.scala:223) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:172) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:104) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:158) > [info] at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:73) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:167) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418) > [info] at > org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:62) > [info] at > org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:45) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:45) > [info] at > org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:40) > [info] at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216) > [info] at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > [info] at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > [info] at scala.collection.immutable.List.foldLeft(List.scala:89) > [info] at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213) > [info] at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSam
[jira] [Assigned] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33867: Assignee: Apache Spark > java.time.Instant and java.time.LocalDate not handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue > --- > > Key: SPARK-33867 > URL: https://issues.apache.org/jira/browse/SPARK-33867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Cristi >Assignee: Apache Spark >Priority: Major > > When using the new java time API (spark.sql.datetime.java8API.enabled=true) > LocalDate and Instant aren't handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown > when they are used in filters since a filter condition would be translated to > something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. > To reproduce you can write a simple filter like where dataset is backed by a > DB table (in my case PostgreSQL): > dataset.filter(current_timestamp().gt(col(VALID_FROM))) > The error and stacktrace: > Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near > "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or > near "T11" Position: 285 at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:127) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33867: Assignee: (was: Apache Spark) > java.time.Instant and java.time.LocalDate not handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue > --- > > Key: SPARK-33867 > URL: https://issues.apache.org/jira/browse/SPARK-33867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Cristi >Priority: Major > > When using the new java time API (spark.sql.datetime.java8API.enabled=true) > LocalDate and Instant aren't handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown > when they are used in filters since a filter condition would be translated to > something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. > To reproduce you can write a simple filter like where dataset is backed by a > DB table (in my case PostgreSQL): > dataset.filter(current_timestamp().gt(col(VALID_FROM))) > The error and stacktrace: > Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near > "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or > near "T11" Position: 285 at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:127) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33867) java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
[ https://issues.apache.org/jira/browse/SPARK-33867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263207#comment-17263207 ] Apache Spark commented on SPARK-33867: -- User 'cristichircu' has created a pull request for this issue: https://github.com/apache/spark/pull/31148 > java.time.Instant and java.time.LocalDate not handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue > --- > > Key: SPARK-33867 > URL: https://issues.apache.org/jira/browse/SPARK-33867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Cristi >Priority: Major > > When using the new java time API (spark.sql.datetime.java8API.enabled=true) > LocalDate and Instant aren't handled in > org.apache.spark.sql.jdbc.JdbcDialect#compileValue so exceptions are thrown > when they are used in filters since a filter condition would be translated to > something like this: "valid_from" > 2020-12-21T11:40:24.413681Z. > To reproduce you can write a simple filter like where dataset is backed by a > DB table (in my case PostgreSQL): > dataset.filter(current_timestamp().gt(col(VALID_FROM))) > The error and stacktrace: > Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near > "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or > near "T11" Position: 285 at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:127) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263222#comment-17263222 ] Apache Spark commented on SPARK-34084: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31149 > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Priority: Major > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34084: Assignee: Apache Spark > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34084: Assignee: (was: Apache Spark) > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Priority: Major > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34085) History server missing failed stage
Yuming Wang created SPARK-34085: --- Summary: History server missing failed stage Key: SPARK-34085 URL: https://issues.apache.org/jira/browse/SPARK-34085 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Yuming Wang It is missing the failed stage(261716). !image-2021-01-12-18-28-34-153.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34085) History server missing failed stage
[ https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34085: Description: It is missing the failed stage(261716). !image-2021-01-12-18-30-45-862.png! was: It is missing the failed stage(261716). !image-2021-01-12-18-28-34-153.png! > History server missing failed stage > --- > > Key: SPARK-34085 > URL: https://issues.apache.org/jira/browse/SPARK-34085 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2021-01-12-18-30-45-862.png > > > It is missing the failed stage(261716). > !image-2021-01-12-18-30-45-862.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34085) History server missing failed stage
[ https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34085: Attachment: image-2021-01-12-18-30-45-862.png > History server missing failed stage > --- > > Key: SPARK-34085 > URL: https://issues.apache.org/jira/browse/SPARK-34085 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2021-01-12-18-30-45-862.png > > > It is missing the failed stage(261716). > !image-2021-01-12-18-28-34-153.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34086) RaiseError generates too much code and may fails codegen
Kent Yao created SPARK-34086: Summary: RaiseError generates too much code and may fails codegen Key: SPARK-34086 URL: https://issues.apache.org/jira/browse/SPARK-34086 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar
[ https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34086: - Summary: RaiseError generates too much code and may fails codegen in length check for char varchar (was: RaiseError generates too much code and may fails codegen) > RaiseError generates too much code and may fails codegen in length check for > char varchar > - > > Key: SPARK-34086 > URL: https://issues.apache.org/jira/browse/SPARK-34086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ > We can reduce more than 8000 bytes by removing the unnecessary CONCAT > expression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar
[ https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263251#comment-17263251 ] Apache Spark commented on SPARK-34086: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31150 > RaiseError generates too much code and may fails codegen in length check for > char varchar > - > > Key: SPARK-34086 > URL: https://issues.apache.org/jira/browse/SPARK-34086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ > We can reduce more than 8000 bytes by removing the unnecessary CONCAT > expression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar
[ https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34086: Assignee: Apache Spark > RaiseError generates too much code and may fails codegen in length check for > char varchar > - > > Key: SPARK-34086 > URL: https://issues.apache.org/jira/browse/SPARK-34086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ > We can reduce more than 8000 bytes by removing the unnecessary CONCAT > expression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34086) RaiseError generates too much code and may fails codegen in length check for char varchar
[ https://issues.apache.org/jira/browse/SPARK-34086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34086: Assignee: (was: Apache Spark) > RaiseError generates too much code and may fails codegen in length check for > char varchar > - > > Key: SPARK-34086 > URL: https://issues.apache.org/jira/browse/SPARK-34086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ > We can reduce more than 8000 bytes by removing the unnecessary CONCAT > expression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34087) a memory leak occurs when we clone the spark session
Fu Chen created SPARK-34087: --- Summary: a memory leak occurs when we clone the spark session Key: SPARK-34087 URL: https://issues.apache.org/jira/browse/SPARK-34087 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Fu Chen In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session because a new ExecutionListenerBus instance will add to AsyncEventQueue when we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fu Chen updated SPARK-34087: Attachment: (was: 1610451044690.jpg) > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fu Chen updated SPARK-34087: Attachment: 1610451044690.jpg > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263264#comment-17263264 ] Fu Chen commented on SPARK-34087: - *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* *!1610451044690.jpg!* Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266 ] Fu Chen commented on SPARK-34087: - *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* !1610451044690.jpg! Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fu Chen updated SPARK-34087: Comment: was deleted (was: *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* *!1610451044690.jpg!* Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object) > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fu Chen updated SPARK-34087: Attachment: 1610451044690.jpg > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266 ] Fu Chen edited comment on SPARK-34087 at 1/12/21, 12:02 PM: *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() SparkSession.clearActiveSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* !1610451044690.jpg! Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object was (Author: fchen): *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* !1610451044690.jpg! Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"
[ https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-34088: - Summary: Rename all decommission configurations to use the same namespace "spark.decommission.*" (was: Rename all decommission configurations to fix same namespace "spark.decommission.*") > Rename all decommission configurations to use the same namespace > "spark.decommission.*" > --- > > Key: SPARK-34088 > URL: https://issues.apache.org/jira/browse/SPARK-34088 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34088) Rename all decommission configurations to fix same namespace "spark.decommission.*"
wuyi created SPARK-34088: Summary: Rename all decommission configurations to fix same namespace "spark.decommission.*" Key: SPARK-34088 URL: https://issues.apache.org/jira/browse/SPARK-34088 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: wuyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263266#comment-17263266 ] Fu Chen edited comment on SPARK-34087 at 1/12/21, 12:03 PM: *bug replay* here is code for replay this bug: {code:java} // run with spark-3.0.1 test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() SparkSession.clearActiveSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* !1610451044690.jpg! Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object was (Author: fchen): *bug replay* here is code for replay this bug: {code:java} test("bug replay") { (1 to 1000).foreach(i => { spark.cloneSession() SparkSession.clearActiveSession() }) val cnt = spark.sparkContext .listenerBus .listeners .asScala .collect{ case e: ExecutionListenerBus => e} .size println(s"total ExecutionListenerBus count ${cnt}.") Thread.sleep(Int.MaxValue) } {code} *output:* total ExecutionListenerBus count 1001. *jmap* !1610451044690.jpg! Each ExecutionListenerBus holds one SparkSession instance, so JVM can't collect these SparkSession object > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"
[ https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-34088: - Description: Currently, decommission configurations are using difference namespaces, e.g., * spark.decommission * spark.storage.decommission * spark.executor.decommission which may introduce unnecessary overhead for end-users. It's better to keep them under the same namespace. > Rename all decommission configurations to use the same namespace > "spark.decommission.*" > --- > > Key: SPARK-34088 > URL: https://issues.apache.org/jira/browse/SPARK-34088 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, decommission configurations are using difference namespaces, e.g., > * spark.decommission > * spark.storage.decommission > * spark.executor.decommission > which may introduce unnecessary overhead for end-users. It's better to keep > them under the same namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"
[ https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34088: Assignee: Apache Spark > Rename all decommission configurations to use the same namespace > "spark.decommission.*" > --- > > Key: SPARK-34088 > URL: https://issues.apache.org/jira/browse/SPARK-34088 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Currently, decommission configurations are using difference namespaces, e.g., > * spark.decommission > * spark.storage.decommission > * spark.executor.decommission > which may introduce unnecessary overhead for end-users. It's better to keep > them under the same namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"
[ https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263289#comment-17263289 ] Apache Spark commented on SPARK-34088: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31151 > Rename all decommission configurations to use the same namespace > "spark.decommission.*" > --- > > Key: SPARK-34088 > URL: https://issues.apache.org/jira/browse/SPARK-34088 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, decommission configurations are using difference namespaces, e.g., > * spark.decommission > * spark.storage.decommission > * spark.executor.decommission > which may introduce unnecessary overhead for end-users. It's better to keep > them under the same namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34088) Rename all decommission configurations to use the same namespace "spark.decommission.*"
[ https://issues.apache.org/jira/browse/SPARK-34088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34088: Assignee: (was: Apache Spark) > Rename all decommission configurations to use the same namespace > "spark.decommission.*" > --- > > Key: SPARK-34088 > URL: https://issues.apache.org/jira/browse/SPARK-34088 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, decommission configurations are using difference namespaces, e.g., > * spark.decommission > * spark.storage.decommission > * spark.executor.decommission > which may introduce unnecessary overhead for end-users. It's better to keep > them under the same namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode
wuyi created SPARK-34089: Summary: MemoryConsumer's memory mode should respect MemoryManager's memory mode Key: SPARK-34089 URL: https://issues.apache.org/jira/browse/SPARK-34089 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0, 2.4.7, 3.1.0 Reporter: wuyi Currently, the memory mode always set to ON_HEAP for memory consumer when it's not explicitly set. However, we actually can know the specific memory mode by taskMemoryManager.getTungstenMemoryMode(). [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode
[ https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263316#comment-17263316 ] Apache Spark commented on SPARK-34089: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31152 > MemoryConsumer's memory mode should respect MemoryManager's memory mode > --- > > Key: SPARK-34089 > URL: https://issues.apache.org/jira/browse/SPARK-34089 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the memory mode always set to ON_HEAP for memory consumer when > it's not explicitly set. > However, we actually can know the specific memory mode by > taskMemoryManager.getTungstenMemoryMode(). > > [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode
[ https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34089: Assignee: (was: Apache Spark) > MemoryConsumer's memory mode should respect MemoryManager's memory mode > --- > > Key: SPARK-34089 > URL: https://issues.apache.org/jira/browse/SPARK-34089 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the memory mode always set to ON_HEAP for memory consumer when > it's not explicitly set. > However, we actually can know the specific memory mode by > taskMemoryManager.getTungstenMemoryMode(). > > [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode
[ https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34089: Assignee: Apache Spark > MemoryConsumer's memory mode should respect MemoryManager's memory mode > --- > > Key: SPARK-34089 > URL: https://issues.apache.org/jira/browse/SPARK-34089 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Currently, the memory mode always set to ON_HEAP for memory consumer when > it's not explicitly set. > However, we actually can know the specific memory mode by > taskMemoryManager.getTungstenMemoryMode(). > > [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34089) MemoryConsumer's memory mode should respect MemoryManager's memory mode
[ https://issues.apache.org/jira/browse/SPARK-34089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263318#comment-17263318 ] Apache Spark commented on SPARK-34089: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31152 > MemoryConsumer's memory mode should respect MemoryManager's memory mode > --- > > Key: SPARK-34089 > URL: https://issues.apache.org/jira/browse/SPARK-34089 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the memory mode always set to ON_HEAP for memory consumer when > it's not explicitly set. > However, we actually can know the specific memory mode by > taskMemoryManager.getTungstenMemoryMode(). > > [https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L43-L45] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegati
Gabor Somogyi created SPARK-34090: - Summary: HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegation token Key: SPARK-34090 URL: https://issues.apache.org/jira/browse/SPARK-34090 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.1 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
wuyi created SPARK-34091: Summary: Shuffle batch fetch can't be disabled once it's enabled previously Key: SPARK-34091 URL: https://issues.apache.org/jira/browse/SPARK-34091 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: wuyi {code:java} if (SQLConf.get.fetchShuffleBlocksInBatch) { dependency.rdd.context.setLocalProperty( SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") } {code} The current code has a problem that once we set `fetchShuffleBlocksInBatch` to true first, we can never disable the batch fetch even if set `fetchShuffleBlocksInBatch` to false later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delega
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263336#comment-17263336 ] Apache Spark commented on SPARK-34090: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/31154 > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34090: Assignee: Apache Spark > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34090: Assignee: (was: Apache Spark) > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delega
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263338#comment-17263338 ] Apache Spark commented on SPARK-34090: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/31154 > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
[ https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34091: Assignee: (was: Apache Spark) > Shuffle batch fetch can't be disabled once it's enabled previously > -- > > Key: SPARK-34091 > URL: https://issues.apache.org/jira/browse/SPARK-34091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > {code:java} > if (SQLConf.get.fetchShuffleBlocksInBatch) { > dependency.rdd.context.setLocalProperty( > SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") > } > {code} > The current code has a problem that once we set `fetchShuffleBlocksInBatch` > to true first, we can never disable the batch fetch even if set > `fetchShuffleBlocksInBatch` to false later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
[ https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34091: Assignee: Apache Spark > Shuffle batch fetch can't be disabled once it's enabled previously > -- > > Key: SPARK-34091 > URL: https://issues.apache.org/jira/browse/SPARK-34091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > {code:java} > if (SQLConf.get.fetchShuffleBlocksInBatch) { > dependency.rdd.context.setLocalProperty( > SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") > } > {code} > The current code has a problem that once we set `fetchShuffleBlocksInBatch` > to true first, we can never disable the batch fetch even if set > `fetchShuffleBlocksInBatch` to false later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
[ https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263343#comment-17263343 ] Apache Spark commented on SPARK-34091: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31155 > Shuffle batch fetch can't be disabled once it's enabled previously > -- > > Key: SPARK-34091 > URL: https://issues.apache.org/jira/browse/SPARK-34091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > {code:java} > if (SQLConf.get.fetchShuffleBlocksInBatch) { > dependency.rdd.context.setLocalProperty( > SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") > } > {code} > The current code has a problem that once we set `fetchShuffleBlocksInBatch` > to true first, we can never disable the batch fetch even if set > `fetchShuffleBlocksInBatch` to false later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34053) Please reduce GitHub Actions matrix or improve the build time
[ https://issues.apache.org/jira/browse/SPARK-34053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263347#comment-17263347 ] Apache Spark commented on SPARK-34053: -- User 'potiuk' has created a pull request for this issue: https://github.com/apache/spark/pull/31153 > Please reduce GitHub Actions matrix or improve the build time > - > > Key: SPARK-34053 > URL: https://issues.apache.org/jira/browse/SPARK-34053 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Vladimir Sitnikov >Assignee: Kamil Bregula >Priority: Critical > Fix For: 3.2.0 > > Attachments: Screen Shot 2021-01-08 at 2.57.07 PM.png > > > GitHub Actions queue is very high for Apache projects, and it looks like a > significant number of executors are occupied by Spark jobs :-( > Note: all Apache projects share the same limit of shared GitHub Actions > runners, and based on the chart below Spark is consuming 20+ runners while > the total limit (for all ASF projects) is 180. > See > https://lists.apache.org/thread.html/r5303eec41cc1dfc51c15dbe44770e37369330f9644ef09813f649120%40%3Cbuilds.apache.org%3E > >number of GA workflows in/progress/queued per project and they clearly show > >the situation is getting worse by day: https://pasteboard.co/JIJa5Xg.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34053) Please reduce GitHub Actions matrix or improve the build time
[ https://issues.apache.org/jira/browse/SPARK-34053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263346#comment-17263346 ] Apache Spark commented on SPARK-34053: -- User 'potiuk' has created a pull request for this issue: https://github.com/apache/spark/pull/31153 > Please reduce GitHub Actions matrix or improve the build time > - > > Key: SPARK-34053 > URL: https://issues.apache.org/jira/browse/SPARK-34053 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Vladimir Sitnikov >Assignee: Kamil Bregula >Priority: Critical > Fix For: 3.2.0 > > Attachments: Screen Shot 2021-01-08 at 2.57.07 PM.png > > > GitHub Actions queue is very high for Apache projects, and it looks like a > significant number of executors are occupied by Spark jobs :-( > Note: all Apache projects share the same limit of shared GitHub Actions > runners, and based on the chart below Spark is consuming 20+ runners while > the total limit (for all ASF projects) is 180. > See > https://lists.apache.org/thread.html/r5303eec41cc1dfc51c15dbe44770e37369330f9644ef09813f649120%40%3Cbuilds.apache.org%3E > >number of GA workflows in/progress/queued per project and they clearly show > >the situation is getting worse by day: https://pasteboard.co/JIJa5Xg.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34084: --- Assignee: Maxim Gekk > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34084. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31149 [https://github.com/apache/spark/pull/31149] > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31144) Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263425#comment-17263425 ] Alex Vayda commented on SPARK-31144: Don't you think that wrapping an {{Error}} into {{Exception}}, just to be able to pass it into the method that, strictly speaking, doesn't expect to be called with an {{Error}}, would break the method semantics? Wouldn't it be better to introduce another (third) method, say `onFatal(..., th: Throwable)` with an empty default implementation (for API backward compatibility), that would be called on errors, that are considered to be fatal from the Java/Scala perspective? See https://www.scala-lang.org/api/2.12.0/scala/util/control/NonFatal$.html > Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure > --- > > Key: SPARK-31144 > URL: https://issues.apache.org/jira/browse/SPARK-31144 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > SPARK-28556 changed the method QueryExecutionListener.onFailure to allow > Spark sending java.lang.Error to this method. As this change breaks APIs, we > cannot fix branch-2.4. > [~marmbrus] suggested to wrap java.lang.Error with an exception instead to > avoid a breaking change. A bonus of this solution is we can also fix the > issue (if a query throws java.lang.Error, QueryExecutionListener doesn't get > notified) in branch-2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
[ https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34091: --- Assignee: wuyi > Shuffle batch fetch can't be disabled once it's enabled previously > -- > > Key: SPARK-34091 > URL: https://issues.apache.org/jira/browse/SPARK-34091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > {code:java} > if (SQLConf.get.fetchShuffleBlocksInBatch) { > dependency.rdd.context.setLocalProperty( > SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") > } > {code} > The current code has a problem that once we set `fetchShuffleBlocksInBatch` > to true first, we can never disable the batch fetch even if set > `fetchShuffleBlocksInBatch` to false later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34091) Shuffle batch fetch can't be disabled once it's enabled previously
[ https://issues.apache.org/jira/browse/SPARK-34091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34091. - Fix Version/s: 3.1.1 Resolution: Fixed Issue resolved by pull request 31155 [https://github.com/apache/spark/pull/31155] > Shuffle batch fetch can't be disabled once it's enabled previously > -- > > Key: SPARK-34091 > URL: https://issues.apache.org/jira/browse/SPARK-34091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.1 > > > {code:java} > if (SQLConf.get.fetchShuffleBlocksInBatch) { > dependency.rdd.context.setLocalProperty( > SortShuffleManager.FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED_KEY, "true") > } > {code} > The current code has a problem that once we set `fetchShuffleBlocksInBatch` > to true first, we can never disable the batch fetch even if set > `fetchShuffleBlocksInBatch` to false later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34087: Assignee: Apache Spark > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Assignee: Apache Spark >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34087: Assignee: (was: Apache Spark) > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263498#comment-17263498 ] Apache Spark commented on SPARK-34087: -- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/31156 > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263499#comment-17263499 ] Apache Spark commented on SPARK-34087: -- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/31156 > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263571#comment-17263571 ] Apache Spark commented on SPARK-34084: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31157 > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263581#comment-17263581 ] Apache Spark commented on SPARK-34084: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31158 > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34084) ALTER TABLE .. ADD PARTITION does not update table stats
[ https://issues.apache.org/jira/browse/SPARK-34084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263582#comment-17263582 ] Apache Spark commented on SPARK-34084: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31158 > ALTER TABLE .. ADD PARTITION does not update table stats > > > Key: SPARK-34084 > URL: https://issues.apache.org/jira/browse/SPARK-34084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 > Environment: strong text >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The example below portraits the issue: > {code:sql} > spark-sql> create table tbl (col0 int, part int) partitioned by (part); > spark-sql> insert into tbl partition (part = 0) select 0; > spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; > spark-sql> alter table tbl add partition (part = 1); > {code} > There are no stats: > {code:sql} > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > Database default > Table tbl > Owner maximgekk > Created Time Tue Jan 12 12:00:03 MSK 2021 > Last Access UNKNOWN > Created BySpark 3.2.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1610442003] > Location > file:/Users/maximgekk/proj/fix-stats-in-add-partition/spark-warehouse/tbl > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties[serialization.format=1] > Partition ProviderCatalog > {code} > *As we can see there is no stats.* For instance, ALTER TABLE .. DROP > PARTITION updates stats: > {code:sql} > spark-sql> alter table tbl drop partition (part = 1); > spark-sql> describe table extended tbl; > col0 int NULL > part int NULL > # Partition Information > # col_namedata_type comment > part int NULL > # Detailed Table Information > ... > Statistics2 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34069) Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
[ https://issues.apache.org/jira/browse/SPARK-34069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-34069. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 31127 [https://github.com/apache/spark/pull/31127] > Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL > --- > > Key: SPARK-34069 > URL: https://issues.apache.org/jira/browse/SPARK-34069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Major > Fix For: 3.1.0 > > > We should interrupt task thread if user set local property > `SPARK_JOB_INTERRUPT_ON_CANCEL` to true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34069) Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
[ https://issues.apache.org/jira/browse/SPARK-34069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-34069: --- Assignee: ulysses you > Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL > --- > > Key: SPARK-34069 > URL: https://issues.apache.org/jira/browse/SPARK-34069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Major > > We should interrupt task thread if user set local property > `SPARK_JOB_INTERRUPT_ON_CANCEL` to true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32691) Update commons-crypto to v1.1.0
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32691: -- Fix Version/s: 3.0.2 > Update commons-crypto to v1.1.0 > --- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Major > Fix For: 3.0.2, 3.1.0 > > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32338) Add overload for slice that accepts Columns or Int
[ https://issues.apache.org/jira/browse/SPARK-32338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263725#comment-17263725 ] Apache Spark commented on SPARK-32338: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/31159 > Add overload for slice that accepts Columns or Int > -- > > Key: SPARK-32338 > URL: https://issues.apache.org/jira/browse/SPARK-32338 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nikolas Vanderhoof >Assignee: Nikolas Vanderhoof >Priority: Trivial > Fix For: 3.1.0 > > > Add an overload for org.apache.spark.sql.functions.slice with the following > signature: > {code:scala} > def slice(x: Column, start: Any, length: Any): Column = ??? > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32338) Add overload for slice that accepts Columns or Int
[ https://issues.apache.org/jira/browse/SPARK-32338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263724#comment-17263724 ] Apache Spark commented on SPARK-32338: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/31159 > Add overload for slice that accepts Columns or Int > -- > > Key: SPARK-32338 > URL: https://issues.apache.org/jira/browse/SPARK-32338 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nikolas Vanderhoof >Assignee: Nikolas Vanderhoof >Priority: Trivial > Fix For: 3.1.0 > > > Add an overload for org.apache.spark.sql.functions.slice with the following > signature: > {code:scala} > def slice(x: Column, start: Any, length: Any): Column = ??? > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263754#comment-17263754 ] Apache Spark commented on SPARK-34080: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/31160 > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34080: Assignee: Apache Spark > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263752#comment-17263752 ] Apache Spark commented on SPARK-34080: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/31160 > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34080: Assignee: (was: Apache Spark) > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34051) Support 32-bit unicode escape in string literals
[ https://issues.apache.org/jira/browse/SPARK-34051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-34051: --- Priority: Minor (was: Major) > Support 32-bit unicode escape in string literals > > > Key: SPARK-34051 > URL: https://issues.apache.org/jira/browse/SPARK-34051 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Currently, Spark supports 16-bit unicode escape like "\u0041" in string > literals. > I think It's nice if 32-bit unicode is also supported like PostgreSQL and > modern programming languages do (e.g, C++11, Rust). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegati
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34090: - Fix Version/s: (was: 3.1.0) 3.1.1 > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34090: Assignee: Gabor Somogyi > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegat
[ https://issues.apache.org/jira/browse/SPARK-34090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34090. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 31154 [https://github.com/apache/spark/pull/31154] > HadoopDelegationTokenManager.isServiceEnabled used in > KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka > stream processing in case of delegation token > - > > Key: SPARK-34090 > URL: https://issues.apache.org/jira/browse/SPARK-34090 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263837#comment-17263837 ] Stephen Kestle commented on SPARK-25075: [~dongjoon] can you add context about points 1 and 2 please (why can't/aren't they published)? I'm guessing that the current focus is on 3.1.0 RCs (which is not scala 2.13). Does a 3.2.0_2.13 snapshot require 3.1.0 being released? When might it start? (And would this support be in an early milestone)? A month ago, I told myself that this was likely to be 6-12 months away, but on recent inspection, perhaps I should expect _something_ a bit sooner (would help start to integrate my code bases) > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34033) SparkR Daemon Initialization
[ https://issues.apache.org/jira/browse/SPARK-34033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263841#comment-17263841 ] Apache Spark commented on SPARK-34033: -- User 'WamBamBoozle' has created a pull request for this issue: https://github.com/apache/spark/pull/31162 > SparkR Daemon Initialization > > > Key: SPARK-34033 > URL: https://issues.apache.org/jira/browse/SPARK-34033 > Project: Spark > Issue Type: Improvement > Components: R, SparkR >Affects Versions: 3.2.0 > Environment: tested on centos 7 & spark 2.3.1 and on my mac & spark > at master >Reporter: Tom Howland >Priority: Major > Original Estimate: 0h > Remaining Estimate: 0h > > Provide a way for users to initialize the sparkR daemon before it forks. > I'm a contractor to Target, where we have several projects doing ML with > sparkR. The changes proposed here results in weeks of compute-time saved with > every run. > (4 partitions) * (5 seconds to load our R libraries) * (2 calls to gapply > in our app) / 60 / 60 = 111 hours. > (from > [docs/sparkr.md|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization]) > h3. Daemon Initialization > If your worker function has a lengthy initialization, and your > application has lots of partitions, you may find you are spending weeks > of compute time repeatedly doing something that should have taken a few > seconds during daemon initialization. > Every Spark executor spawns a process running an R daemon. The daemon > "forks a copy" of itself whenever Spark finds work for it to do. It may > be applying a predefined method such as "max", or it may be applying > your worker function. SparkR::gapply arranges things so that your worker > function will be called with each group. A group is the pair > Key-Seq[Row]. In the absence of partitioning, the daemon will fork for > every group found. With partitioning, the daemon will fork for every > partition found. A partition may have several groups in it. > All the initializations and library loading your worker function manages > is thrown away when the fork concludes. Every fork has to be > initialized. > The configuration spark.r.daemonInit provides a way to avoid reloading > packages every time the daemon forks by having the daemon pre-load > packages. You do this by providing R code to initialize the daemon for > your application. > h4. Examples > Suppose we want library(wow) to be pre-loaded for our workers. > {{sparkR.session(spark.r.daemonInit = 'library(wow)')}} > of course, that would only work if we knew that library(wow) was on our > path and available on the executor. If we have to ship the library, we > can use YARN > sparkR.session( > master = 'yarn', > spark.r.daemonInit = '.libPaths(c("wowTarget", .libPaths())); > library(wow)', > spark.submit.deployMode = 'client', > spark.yarn.dist.archives = 'wow.zip#wowTarget') > YARN creates a directory for the new executor, unzips 'wow.zip' in some > other directory, and then provides a symlink to it called > ./wowTarget. When the executor starts the daemon, the daemon loads > library(wow) from the newly created wowTarget. > Warning: if your initialization takes longer than 10 seconds, consider > increasing the configuration > [spark.r.daemonTimeout](configuration.md#sparkr). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34085) History server missing failed stage
[ https://issues.apache.org/jira/browse/SPARK-34085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34085. - Resolution: Invalid > History server missing failed stage > --- > > Key: SPARK-34085 > URL: https://issues.apache.org/jira/browse/SPARK-34085 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2021-01-12-18-30-45-862.png > > > It is missing the failed stage(261716). > !image-2021-01-12-18-30-45-862.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34092) Stage level RestFul API support filter by task status
angerszhu created SPARK-34092: - Summary: Stage level RestFul API support filter by task status Key: SPARK-34092 URL: https://issues.apache.org/jira/browse/SPARK-34092 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: angerszhu Support filter Task by taskstatus when details is true -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34093) param maxDepth should check upper bound
zhengruifeng created SPARK-34093: Summary: param maxDepth should check upper bound Key: SPARK-34093 URL: https://issues.apache.org/jira/browse/SPARK-34093 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34093) param maxDepth should check upper bound
[ https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34093: Assignee: (was: Apache Spark) > param maxDepth should check upper bound > --- > > Key: SPARK-34093 > URL: https://issues.apache.org/jira/browse/SPARK-34093 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34093) param maxDepth should check upper bound
[ https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263872#comment-17263872 ] Apache Spark commented on SPARK-34093: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31163 > param maxDepth should check upper bound > --- > > Key: SPARK-34093 > URL: https://issues.apache.org/jira/browse/SPARK-34093 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34093) param maxDepth should check upper bound
[ https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263871#comment-17263871 ] Apache Spark commented on SPARK-34093: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31163 > param maxDepth should check upper bound > --- > > Key: SPARK-34093 > URL: https://issues.apache.org/jira/browse/SPARK-34093 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34093) param maxDepth should check upper bound
[ https://issues.apache.org/jira/browse/SPARK-34093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34093: Assignee: Apache Spark > param maxDepth should check upper bound > --- > > Key: SPARK-34093 > URL: https://issues.apache.org/jira/browse/SPARK-34093 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= 0x10000
Kousuke Saruta created SPARK-34094: -- Summary: Extends StringTranslate to support unicode characters whose code point >= 0x1 Key: SPARK-34094 URL: https://issues.apache.org/jira/browse/SPARK-34094 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Currently, StringTranslate works with only unicode characters whose code point < 0x1 so let's extend it to support code points >= 0x1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= U+10000
[ https://issues.apache.org/jira/browse/SPARK-34094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-34094: --- Summary: Extends StringTranslate to support unicode characters whose code point >= U+1 (was: Extends StringTranslate to support unicode characters whose code point >= 0x1) > Extends StringTranslate to support unicode characters whose code point >= > U+1 > - > > Key: SPARK-34094 > URL: https://issues.apache.org/jira/browse/SPARK-34094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Currently, StringTranslate works with only unicode characters whose code > point < 0x1 so let's extend it to support code points >= 0x1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org