[jira] [Commented] (SPARK-32376) Make unionByName null-filling behavior work with struct columns
[ https://issues.apache.org/jira/browse/SPARK-32376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169241#comment-17169241 ] L. C. Hsieh commented on SPARK-32376: - Hi [~mukulmurthy], are you ok if I go to work on this? > Make unionByName null-filling behavior work with struct columns > --- > > Key: SPARK-32376 > URL: https://issues.apache.org/jira/browse/SPARK-32376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-29358 added support for > unionByName to work when the two datasets didn't necessarily have the same > schema, but it does not work with nested columns like structs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-32467. Resolution: Fixed The issue is resolved in https://github.com/apache/spark/pull/29271 > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-32467: --- Affects Version/s: 3.0.1 > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32119: --- Description: ExecutorPlugin can't work with Standalone Cluster and Kubernetes when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works with YARN because jars/files are distributed as distributed cache. was: ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster manager too except YARN. ) when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works with YARN because jars/files are distributed as distributed cache. > ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars > -- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster and Kubernetes > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32119: --- Summary: ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars (was: ExecutorPlugin doesn't work with Standalone Cluster) > ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars > -- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169219#comment-17169219 ] JinxinTang edited comment on SPARK-32500 at 8/1/20, 2:24 AM: - Hi, [~abhishekd0907], I have just tested the code in master, branch-2.4.6 and branch-3.0.0, seems all can work fine in pyspark as follows: !image-2020-08-01-10-21-51-246.png! was (Author: jinxintang): Hi, [~abhishekd0907], I have test the code in master, branch-2.4.6 and brach-3.0.0, seems all can work fine in pyspark as follows: !image-2020-08-01-10-21-51-246.png! > Query and Batch Id not set for Structured Streaming Jobs in case of > ForeachBatch in PySpark > --- > > Key: SPARK-32500 > URL: https://issues.apache.org/jira/browse/SPARK-32500 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Abhishek Dixit >Priority: Major > Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, > image-2020-08-01-10-21-51-246.png > > > Query Id and Batch Id information is not available for jobs started by > structured streaming query when _foreachBatch_ API is used in PySpark. > This happens only with foreachBatch in pyspark. ForeachBatch in scala works > fine, and also other structured streaming sinks in pyspark work fine. I am > attaching a screenshot of jobs pages. > I think job group is not set properly when _foreachBatch_ is used via > pyspark. I have a framework that depends on the _queryId_ and _batchId_ > information available in the job properties and so my framework doesn't work > for pyspark-foreachBatch use case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169219#comment-17169219 ] JinxinTang commented on SPARK-32500: Hi, [~abhishekd0907], I have test the code in master, branch-2.4.6 and brach-3.0.0, seems all can work fine in pyspark as follows: !image-2020-08-01-10-21-51-246.png! > Query and Batch Id not set for Structured Streaming Jobs in case of > ForeachBatch in PySpark > --- > > Key: SPARK-32500 > URL: https://issues.apache.org/jira/browse/SPARK-32500 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Abhishek Dixit >Priority: Major > Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, > image-2020-08-01-10-21-51-246.png > > > Query Id and Batch Id information is not available for jobs started by > structured streaming query when _foreachBatch_ API is used in PySpark. > This happens only with foreachBatch in pyspark. ForeachBatch in scala works > fine, and also other structured streaming sinks in pyspark work fine. I am > attaching a screenshot of jobs pages. > I think job group is not set properly when _foreachBatch_ is used via > pyspark. I have a framework that depends on the _queryId_ and _batchId_ > information available in the job properties and so my framework doesn't work > for pyspark-foreachBatch use case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JinxinTang updated SPARK-32500: --- Attachment: image-2020-08-01-10-21-51-246.png > Query and Batch Id not set for Structured Streaming Jobs in case of > ForeachBatch in PySpark > --- > > Key: SPARK-32500 > URL: https://issues.apache.org/jira/browse/SPARK-32500 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Abhishek Dixit >Priority: Major > Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, > image-2020-08-01-10-21-51-246.png > > > Query Id and Batch Id information is not available for jobs started by > structured streaming query when _foreachBatch_ API is used in PySpark. > This happens only with foreachBatch in pyspark. ForeachBatch in scala works > fine, and also other structured streaming sinks in pyspark work fine. I am > attaching a screenshot of jobs pages. > I think job group is not set properly when _foreachBatch_ is used via > pyspark. I have a framework that depends on the _queryId_ and _batchId_ > information available in the job properties and so my framework doesn't work > for pyspark-foreachBatch use case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32514) Pyspark: Issue using sql query in foreachBatch sink
[ https://issues.apache.org/jira/browse/SPARK-32514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32514. -- Resolution: Not A Problem Python doesn't allow abbreviating () with no param, whereas Scala does. Use `write()`, not `write`. > Pyspark: Issue using sql query in foreachBatch sink > --- > > Key: SPARK-32514 > URL: https://issues.apache.org/jira/browse/SPARK-32514 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 2.4.6, 3.0.0 >Reporter: Muru >Priority: Major > > In a pyspark SS job, trying to use sql query instead of DF API methods in > foreachBatch sink > throws AttributeError: 'JavaMember' object has no attribute 'format' > exception. > > However, the same thing works in Scala job. > > Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same > exception. > > I noticed that I could perform other operations except the write method. > > Please, let me know how to fix this issue. > > See below code examples > # Spark Scala method > def processData(batchDF: DataFrame, batchId: Long) { > batchDF.createOrReplaceTempView("tbl") > val outdf=batchDF.sparkSession.sql("select action, count(*) as count from > tbl where date='2020-06-20' group by 1") > outdf.printSchema() > outdf.show > outdf.coalesce(1).write.format("csv").save("/tmp/agg") > } > > ## pyspark python method > def process_data(bdf, bid): > lspark = bdf._jdf.sparkSession() > bdf.createOrReplaceTempView("tbl") > outdf=lspark.sql("select action, count(*) as count from tbl where > date='2020-06-20' group by 1") > outdf.printSchema() > # it works > outdf.show() > # throws AttributeError: 'JavaMember' object has no attribute 'format' > exception > outdf.coalesce(1).write.format("csv").save("/tmp/agg1") > > Here is the full exception > 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = > 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = > e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error > py4j.Py4JException: An exception was raised by the Python Proxy. Return > Message: Traceback (most recent call last): > File > "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 2381, in _call_proxy > return_value = getattr(self.pool[obj_id], method)(*params) > File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call > raise e > AttributeError: 'JavaMember' object has no attribute 'format' > at py4j.Protocol.getReturnValue(Protocol.java:473) > at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108) > at com.sun.proxy.$Proxy20.call(Unknown Source) > at > org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) > at > org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) > at > org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > [org.apache.spark.sql.execution.streaming.MicroBatchExecution.org|http://org.apache.spark.sql.execution.streaming.microbatchexecution.org/]$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTim
[jira] [Commented] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169209#comment-17169209 ] Apache Spark commented on SPARK-32402: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29324 > Implement ALTER TABLE in JDBC Table Catalog > --- > > Key: SPARK-32402 > URL: https://issues.apache.org/jira/browse/SPARK-32402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/29168 adds basic implementation > of JDBC Table Catalog. This ticket aims to support table altering in: > - JDBC dialects > - JDBCTableCatalog > and to add tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32402: Assignee: (was: Apache Spark) > Implement ALTER TABLE in JDBC Table Catalog > --- > > Key: SPARK-32402 > URL: https://issues.apache.org/jira/browse/SPARK-32402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/29168 adds basic implementation > of JDBC Table Catalog. This ticket aims to support table altering in: > - JDBC dialects > - JDBCTableCatalog > and to add tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32402: Assignee: Apache Spark > Implement ALTER TABLE in JDBC Table Catalog > --- > > Key: SPARK-32402 > URL: https://issues.apache.org/jira/browse/SPARK-32402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The PR https://github.com/apache/spark/pull/29168 adds basic implementation > of JDBC Table Catalog. This ticket aims to support table altering in: > - JDBC dialects > - JDBCTableCatalog > and to add tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169208#comment-17169208 ] Apache Spark commented on SPARK-32402: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29324 > Implement ALTER TABLE in JDBC Table Catalog > --- > > Key: SPARK-32402 > URL: https://issues.apache.org/jira/browse/SPARK-32402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/29168 adds basic implementation > of JDBC Table Catalog. This ticket aims to support table altering in: > - JDBC dialects > - JDBCTableCatalog > and to add tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32514) Pyspark: Issue using sql query in foreachBatch sink
Muru created SPARK-32514: Summary: Pyspark: Issue using sql query in foreachBatch sink Key: SPARK-32514 URL: https://issues.apache.org/jira/browse/SPARK-32514 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0, 2.4.6, 2.4.5 Reporter: Muru In a pyspark SS job, trying to use sql query instead of DF API methods in foreachBatch sink throws AttributeError: 'JavaMember' object has no attribute 'format' exception. However, the same thing works in Scala job. Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same exception. I noticed that I could perform other operations except the write method. Please, let me know how to fix this issue. See below code examples # Spark Scala method def processData(batchDF: DataFrame, batchId: Long) { batchDF.createOrReplaceTempView("tbl") val outdf=batchDF.sparkSession.sql("select action, count(*) as count from tbl where date='2020-06-20' group by 1") outdf.printSchema() outdf.show outdf.coalesce(1).write.format("csv").save("/tmp/agg") } ## pyspark python method def process_data(bdf, bid): lspark = bdf._jdf.sparkSession() bdf.createOrReplaceTempView("tbl") outdf=lspark.sql("select action, count(*) as count from tbl where date='2020-06-20' group by 1") outdf.printSchema() # it works outdf.show() # throws AttributeError: 'JavaMember' object has no attribute 'format' exception outdf.coalesce(1).write.format("csv").save("/tmp/agg1") Here is the full exception 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy return_value = getattr(self.pool[obj_id], method)(*params) File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call raise e AttributeError: 'JavaMember' object has no attribute 'format' at py4j.Protocol.getReturnValue(Protocol.java:473) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108) at com.sun.proxy.$Proxy20.call(Unknown Source) at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at [org.apache.spark.sql.execution.streaming.MicroBatchExecution.org|http://org.apache.spark.sql.execution.streaming.microbatchexecution.org/]$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) at [org.apache.spark.sql.e
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169132#comment-17169132 ] Sunitha Kambhampati commented on SPARK-32018: - [@cloud-fan|https://github.com/cloud-fan], I noticed the back ports now. This change is more far reaching in its impact as previous callers of UnsafeRow.getDecimal that would have thrown an exception earlier would now return null. As an e.g, a caller like aggregate sum will need changes to account for this. Earlier cases where sum would throw error for overflow will *now return incorrect results*. The new tests that were added for sum overflow cases in the DataFrameSuite in master can be used to see repro. IMO, it would be better to not back port the setDecimal change in isolation. wdyt? Please share your thoughts. Thanks. I added a comment on the pr but since it is closed, adding a comment here. > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside a StructType Column (with similar semantics to the existing {{drop}} method on {{Dataset}}). It should also be able to handle deeply nested columns through the same API. This is similar to the {{withField}} method that was recently added in SPARK-31317 and likely we can re-use some of that "infrastructure." The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column nested inside a StructType > Column (with similar semantics to the existing {{drop}} method on > {{Dataset}}). > It should also be able to handle deeply nested columns through the same API. > This is similar to the {{withField}} method that was recently added in > SPARK-31317 and likely we can re-use some of that "infrastructure." > The public-facing method signature should be something along the following > lines: > {noformat} > def dropFields(fieldNames: String*): Column > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). > This is similar to the {{withField}} that was recently added in SPARK-31317 > and likely can re-use some of that infrastructure. > The public-facing method signature should be something along the following > lines: > {noformat} > def dropFields(fieldNames: String*): Column > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). > This is similar to the {{withField}} that was recently added in SPARK-31317 > and likely can re-use some of that infrastructure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket (SPARK-22231) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket (SPARK-22241) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231) and following on, > add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the extensive discussions in the parent ticket, it was determined we should add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside another StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}. Ideally, this method should be able to handle dropping columns at arbitrary levels of nesting in a StructType Column. was: Based on the discussions in the parent ticket, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the extensive discussions in the parent ticket, it was determined we > should add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column nested inside another > StructType Column, with similar semantics to the {{drop}} method on > {{Dataset}}. > Ideally, this method should be able to handle dropping columns at arbitrary > levels of nesting in a StructType Column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22241) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the extensive discussions in the parent ticket, it was determined we should add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside another StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}. Ideally, this method should be able to handle dropping columns at arbitrary levels of nesting in a StructType Column. > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22241) and following on, > add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket, Added a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a {{StructField}} in a {{StructType}} column (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket, add a new {{dropFields}} > method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket, Added a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a {{StructField}} in a {{StructType}} column (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket, > Added a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a {{StructField}} in a {{StructType}} > column (with similar semantics to the {{drop}} method on {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169087#comment-17169087 ] Rohit Mishra commented on SPARK-32511: -- [~fqaiser94], Can you please add a description? > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC
[ https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169081#comment-17169081 ] Apache Spark commented on SPARK-32513: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29323 > Rename classes/files with the Jdbc prefix to JDBC > - > > Key: SPARK-32513 > URL: https://issues.apache.org/jira/browse/SPARK-32513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > I have found 7 files with Jdbc: > {code} > ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f > ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala > ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala > ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala > {code} > and 8 starts from JDBC: > {code} > ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC
[ https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32513: Assignee: Apache Spark > Rename classes/files with the Jdbc prefix to JDBC > - > > Key: SPARK-32513 > URL: https://issues.apache.org/jira/browse/SPARK-32513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > I have found 7 files with Jdbc: > {code} > ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f > ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala > ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala > ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala > {code} > and 8 starts from JDBC: > {code} > ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC
[ https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169082#comment-17169082 ] Apache Spark commented on SPARK-32513: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29323 > Rename classes/files with the Jdbc prefix to JDBC > - > > Key: SPARK-32513 > URL: https://issues.apache.org/jira/browse/SPARK-32513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > I have found 7 files with Jdbc: > {code} > ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f > ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala > ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala > ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala > {code} > and 8 starts from JDBC: > {code} > ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC
[ https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32513: Assignee: (was: Apache Spark) > Rename classes/files with the Jdbc prefix to JDBC > - > > Key: SPARK-32513 > URL: https://issues.apache.org/jira/browse/SPARK-32513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > I have found 7 files with Jdbc: > {code} > ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f > ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala > ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala > ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala > {code} > and 8 starts from JDBC: > {code} > ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala > ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC
Maxim Gekk created SPARK-32513: -- Summary: Rename classes/files with the Jdbc prefix to JDBC Key: SPARK-32513 URL: https://issues.apache.org/jira/browse/SPARK-32513 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk I have found 7 files with Jdbc: {code} ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala {code} and 8 starts from JDBC: {code} ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169050#comment-17169050 ] Maxim Gekk commented on SPARK-32405: properties passed to createTable are ignored, see [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala#L116] > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169042#comment-17169042 ] Cheng Su commented on SPARK-31470: -- [~yumwang] - thanks for creating this jira! Would like to know more details underneath. # For `SORT BY` here, are you more referring to local sort per spark task, or global sort, or some other techniques like z-ordering to handle multiple filters? (It seems that databricks delta already supported it - [https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html]) # Where do we plan to store this metadata information in catalog, and what's the structure looking like? # Are you more targeting improving filter performance, or other operators like join and group-by? (where I think a global sort should help save shuffle and sort for join and group-by) # I feel it should be minor change to support the feature, but how do we drive alignments across compute engines like presto and impala in the future? > Introduce SORTED BY clause in CREATE TABLE statement > > > Key: SPARK-31470 > URL: https://issues.apache.org/jira/browse/SPARK-31470 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We usually sort on frequently filtered columns when writing data to improve > query performance. But there is no these info in the table information. > > {code:sql} > CREATE TABLE t(day INT, hour INT, year INT, month INT) > USING parquet > PARTITIONED BY (year, month) > SORTED BY (day, hour); > {code} > > Impala, Oracle and redshift support this clause: > https://issues.apache.org/jira/browse/IMPALA-4166 > https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B > https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169041#comment-17169041 ] Huaxin Gao commented on SPARK-32405: [~maxgekk] Hi Max, could you please explain what these table options are? I initially thought you mean createTableOptions, but createTableOptions is appended to the create table stmt and no need to generate in `JdbcDialect`. {code:java} val sql = s"CREATE TABLE $tableName ($strSchema) $createTableOptions" statement.executeUpdate(sql) {code} I guess you mean something else? Could you please give me an example? Thanks! > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension
[ https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32332: -- Fix Version/s: 3.0.1 > AQE doesn't adequately allow for Columnar Processing extension > --- > > Key: SPARK-32332 > URL: https://issues.apache.org/jira/browse/SPARK-32332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > In SPARK-27396 we added support to extended Columnar Processing. We did the > initial work as to what we thought was sufficient but adaptive query > execution was being developed at the same time. > We have discovered that the changes made to AQE are not sufficient for users > to properly extend it for columnar processing because AQE hardcodes to look > for specific classes/execs. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32512) Add basic partition command for datasourcev2
[ https://issues.apache.org/jira/browse/SPARK-32512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-32512: --- Description: This Jira is trying to add basic partition command API, `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec`, to support operating datasourcev2 partitions. This will use the new partition API defined in [SPARK-31694|https://issues.apache.org/jira/browse/SPARK-31694]. > Add basic partition command for datasourcev2 > > > Key: SPARK-32512 > URL: https://issues.apache.org/jira/browse/SPARK-32512 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > This Jira is trying to add basic partition command API, > `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec`, to support > operating datasourcev2 partitions. This will use the new partition API > defined in [SPARK-31694|https://issues.apache.org/jira/browse/SPARK-31694]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32511: Assignee: (was: Apache Spark) > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32511: Assignee: Apache Spark > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168966#comment-17168966 ] Apache Spark commented on SPARK-32511: -- User 'fqaiser94' has created a pull request for this issue: https://github.com/apache/spark/pull/29322 > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32512) Add basic partition command for datasourcev2
Jackey Lee created SPARK-32512: -- Summary: Add basic partition command for datasourcev2 Key: SPARK-32512 URL: https://issues.apache.org/jira/browse/SPARK-32512 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Jackey Lee -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32511) Add dropFields method to Column class
fqaiser94 created SPARK-32511: - Summary: Add dropFields method to Column class Key: SPARK-32511 URL: https://issues.apache.org/jira/browse/SPARK-32511 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: fqaiser94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168946#comment-17168946 ] Apache Spark commented on SPARK-32083: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29321 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168945#comment-17168945 ] Apache Spark commented on SPARK-32083: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29321 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168904#comment-17168904 ] Xuyan Xiao commented on SPARK-22947: [~icexelloss] Any update on this issue? > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin >Priority: Major > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, id, quantity > 20160101, 1, 100 > 20160101, 2, 50 > 20160102, 1, -50 > 20160102, 2, 50 > dfB: > time, id, price > 20151231, 1, 100.0 > 20150102, 1, 105.0 > 20150102, 2, 195.0 > Output: > time, id, quantity, price > 20160101, 1, 100, 100.0 > 20160101, 2, 50, null > 20160102, 1, -50, 105.0 > 20160102, 2, 50, 195.0 > {code} > h2. Optional Design Sketch > h3. Implementation A > (This is just initial thought of how to implement
[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168868#comment-17168868 ] SHOBHIT SHUKLA commented on SPARK-32495: [~prashant] Here is the link where Jackson databind community mentioned in which version they have fixed above mentioned CVEs. https://github.com/advisories/GHSA-h592-38cm-4ggp https://github.com/advisories/GHSA-w3f4-3q6j-rh82 > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by > CVE-2017-15095 and CVE-2018-5968 CVEs > [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to > upgrade the jackson version for spark-2.4.6 and so on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-32495: --- Description: As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by CVE-2017-15095 and CVE-2018-5968 CVEs [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to upgrade the jackson version for spark-2.4.6 and so on(2.4.x). was: As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson version for spark-2.4.6 and so on(2.4.x). > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by > CVE-2017-15095 and CVE-2018-5968 CVEs > [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to > upgrade the jackson version for spark-2.4.6 and so on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168853#comment-17168853 ] SHOBHIT SHUKLA edited comment on SPARK-32495 at 7/31/20, 1:15 PM: -- Above description updated with actual CVEs what we are looking for fix, The NVD databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968 was (Author: sshukla05): The NVD databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968 > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by > CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for > jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson > version for spark-2.4.6 and so on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-32495: --- Description: As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by CVE-2017-15095 and CVE-2018-5968 CVEs [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so on(2.4.x). was: Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so on(2.4.x). > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by > CVE-2017-15095 and CVE-2018-5968 CVEs > [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be > possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so > on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-32495: --- Description: As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson version for spark-2.4.6 and so on(2.4.x). was: As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by CVE-2017-15095 and CVE-2018-5968 CVEs [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so on(2.4.x). > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by > CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for > jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson > version for spark-2.4.6 and so on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
[ https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168856#comment-17168856 ] Rohit Mishra commented on SPARK-32509: -- [~prakharjain09], Meanwhile you are working on the pull request, can you please attach code sample, Environment detail or output snapshot related to the bug for reference of the wider audience and to avoid any duplicate issue. Thanks. > Unused DPP Filter causes issue in canonicalization and prevents reuse exchange > -- > > Key: SPARK-32509 > URL: https://issues.apache.org/jira/browse/SPARK-32509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Prakhar Jain >Priority: Major > > As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply > replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be > avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` > partition filter inside the FileSourceScanExec affects the canonicalization > of the node and so in many cases, this can prevent ReuseExchange from > happening. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168853#comment-17168853 ] SHOBHIT SHUKLA commented on SPARK-32495: The NVD databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968 > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be > possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so > on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168850#comment-17168850 ] Apache Spark commented on SPARK-32507: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29320 > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32507: Assignee: Hyukjin Kwon (was: Apache Spark) > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32507: Assignee: Apache Spark (was: Hyukjin Kwon) > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168848#comment-17168848 ] Apache Spark commented on SPARK-32507: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29320 > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32480) Support insert overwrite to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168833#comment-17168833 ] Apache Spark commented on SPARK-32480: -- User 'Udbhav30' has created a pull request for this issue: https://github.com/apache/spark/pull/29319 > Support insert overwrite to move the data to trash > --- > > Key: SPARK-32480 > URL: https://issues.apache.org/jira/browse/SPARK-32480 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
[ https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168831#comment-17168831 ] Apache Spark commented on SPARK-32509: -- User 'prakharjain09' has created a pull request for this issue: https://github.com/apache/spark/pull/29318 > Unused DPP Filter causes issue in canonicalization and prevents reuse exchange > -- > > Key: SPARK-32509 > URL: https://issues.apache.org/jira/browse/SPARK-32509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Prakhar Jain >Priority: Major > > As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply > replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be > avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` > partition filter inside the FileSourceScanExec affects the canonicalization > of the node and so in many cases, this can prevent ReuseExchange from > happening. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32480) Support insert overwrite to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168829#comment-17168829 ] Apache Spark commented on SPARK-32480: -- User 'Udbhav30' has created a pull request for this issue: https://github.com/apache/spark/pull/29319 > Support insert overwrite to move the data to trash > --- > > Key: SPARK-32480 > URL: https://issues.apache.org/jira/browse/SPARK-32480 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures
[ https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168830#comment-17168830 ] Apache Spark commented on SPARK-32510: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29317 > JDBC doesn't check duplicate column names in nested structures > --- > > Key: SPARK-32510 > URL: https://issues.apache.org/jira/browse/SPARK-32510 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks > duplicates on top-level but not in nested structures as other built-in > datasources do, see > [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32480) Support insert overwrite to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32480: Assignee: Apache Spark > Support insert overwrite to move the data to trash > --- > > Key: SPARK-32480 > URL: https://issues.apache.org/jira/browse/SPARK-32480 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Apache Spark >Priority: Minor > > Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
[ https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32509: Assignee: (was: Apache Spark) > Unused DPP Filter causes issue in canonicalization and prevents reuse exchange > -- > > Key: SPARK-32509 > URL: https://issues.apache.org/jira/browse/SPARK-32509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Prakhar Jain >Priority: Major > > As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply > replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be > avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` > partition filter inside the FileSourceScanExec affects the canonicalization > of the node and so in many cases, this can prevent ReuseExchange from > happening. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
[ https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32509: Assignee: Apache Spark > Unused DPP Filter causes issue in canonicalization and prevents reuse exchange > -- > > Key: SPARK-32509 > URL: https://issues.apache.org/jira/browse/SPARK-32509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Prakhar Jain >Assignee: Apache Spark >Priority: Major > > As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply > replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be > avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` > partition filter inside the FileSourceScanExec affects the canonicalization > of the node and so in many cases, this can prevent ReuseExchange from > happening. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures
[ https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32510: Assignee: Apache Spark > JDBC doesn't check duplicate column names in nested structures > --- > > Key: SPARK-32510 > URL: https://issues.apache.org/jira/browse/SPARK-32510 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks > duplicates on top-level but not in nested structures as other built-in > datasources do, see > [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32480) Support insert overwrite to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32480: Assignee: (was: Apache Spark) > Support insert overwrite to move the data to trash > --- > > Key: SPARK-32480 > URL: https://issues.apache.org/jira/browse/SPARK-32480 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures
[ https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32510: Assignee: (was: Apache Spark) > JDBC doesn't check duplicate column names in nested structures > --- > > Key: SPARK-32510 > URL: https://issues.apache.org/jira/browse/SPARK-32510 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks > duplicates on top-level but not in nested structures as other built-in > datasources do, see > [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168825#comment-17168825 ] Prashant Sharma commented on SPARK-32495: - Furthermore, according to https://github.com/FasterXML/jackson-databind/commits/2.6 , the version 2.6.7.3 has fixes to all the CVE upto version 2.9.10 which is >= 2.9.8. > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be > possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so > on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures
Maxim Gekk created SPARK-32510: -- Summary: JDBC doesn't check duplicate column names in nested structures Key: SPARK-32510 URL: https://issues.apache.org/jira/browse/SPARK-32510 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks duplicates on top-level but not in nested structures as other built-in datasources do, see [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
Prakhar Jain created SPARK-32509: Summary: Unused DPP Filter causes issue in canonicalization and prevents reuse exchange Key: SPARK-32509 URL: https://issues.apache.org/jira/browse/SPARK-32509 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.0.1, 3.1.0 Reporter: Prakhar Jain As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` partition filter inside the FileSourceScanExec affects the canonicalization of the node and so in many cases, this can prevent ReuseExchange from happening. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
[ https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32508: Assignee: (was: Apache Spark) > Disallow empty part col values in partition spec before static partition > writing > > > Key: SPARK-32508 > URL: https://issues.apache.org/jira/browse/SPARK-32508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: dzcxzl >Priority: Trivial > > When writing to the current static partition, the partition field is empty, > and an error will be reported when all tasks are completed. > We can prevent such behavior before submitting the task. > > {code:java} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for > key d is null or empty; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) > at > org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
[ https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32508: Assignee: Apache Spark > Disallow empty part col values in partition spec before static partition > writing > > > Key: SPARK-32508 > URL: https://issues.apache.org/jira/browse/SPARK-32508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Trivial > > When writing to the current static partition, the partition field is empty, > and an error will be reported when all tasks are completed. > We can prevent such behavior before submitting the task. > > {code:java} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for > key d is null or empty; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) > at > org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
[ https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168630#comment-17168630 ] Apache Spark commented on SPARK-32508: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/29316 > Disallow empty part col values in partition spec before static partition > writing > > > Key: SPARK-32508 > URL: https://issues.apache.org/jira/browse/SPARK-32508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: dzcxzl >Priority: Trivial > > When writing to the current static partition, the partition field is empty, > and an error will be reported when all tasks are completed. > We can prevent such behavior before submitting the task. > > {code:java} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for > key d is null or empty; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) > at > org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
dzcxzl created SPARK-32508: -- Summary: Disallow empty part col values in partition spec before static partition writing Key: SPARK-32508 URL: https://issues.apache.org/jira/browse/SPARK-32508 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: dzcxzl When writing to the current static partition, the partition field is empty, and an error will be reported when all tasks are completed. We can prevent such behavior before submitting the task. {code:java} org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key d is null or empty; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) at org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)
[ https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168628#comment-17168628 ] Prashant Sharma commented on SPARK-32495: - Is this not specific to, jackson-databind? And the link says, version 2.6.7.3 also has the fix to CVEs. It appears, the issue SPARK-30333 already fixed it. Did I miss anything here? > Update jackson versions from 2.4.6 and so on(2.4.x) > --- > > Key: SPARK-32495 > URL: https://issues.apache.org/jira/browse/SPARK-32495 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: SHOBHIT SHUKLA >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be > possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so > on(2.4.x). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32268) Bloom Filter Join
[ https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168582#comment-17168582 ] Yuming Wang commented on SPARK-32268: - {code:scala} import org.apache.spark.benchmark.Benchmark import org.apache.spark.util.sketch.BloomFilter val N = 2 val path = "/tmp/spark/bloomfilter" spark.range(N).write.mode("overwrite").parquet(path) val benchmark = new Benchmark(s"Benchmark bloom filter, HashSet and In", valuesPerIteration = N, minNumIters = 2) Seq(10, 100, 1000, 1, 10, 100, 500, 1000).foreach { items => benchmark.addCase(s"bloom filter with ${items}") { _ => val br = BloomFilter.create(items) (1 to items).foreach(br.putLong(_)) println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(br)) spark.read.parquet(path).filter(r => br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save() } benchmark.addCase(s"HashSet with ${items}") { _ => val hs = new java.util.HashSet[Long](items) (1 to items).foreach(hs.add(_)) println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(hs)) spark.read.parquet(path).filter(r => hs.contains(r.getLong(0))).write.format("noop").mode("overwrite").save() } if (items <= 10) { benchmark.addCase(s"in with ${items}") { _ => spark.read.parquet(path).filter(s"id in (${(1 to items).mkString(", ")})").write.format("noop").mode("overwrite").save() } } } benchmark.run() } {code} |num Items|bloom filter(ms)|hash set(ms)|In predicate(ms)|sizeOf bloom filter(bytes)|sizeOf hash set(bytes)| |10|6635|4147|602|80|720| |100|7374|3908|3195|160|6720| |1000|7490|5095|4126|984|64288| |1|7533|4196|6081|9192|625632| |10|7838|5154|23406|91296|6648672| |100|8950|8425|N/A|912376|64388704| |500|10465|25624|N/A|4561592|313554528| |1000|19287|104596|N/A|9123120|627108960| > Bloom Filter Join > - > > Key: SPARK-32268 > URL: https://issues.apache.org/jira/browse/SPARK-32268 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: q16-bloom-filter.jpg, q16-default.jpg > > > We can improve the performance of some joins by pre-filtering one side of a > join using a Bloom filter and IN predicate generated from the values from the > other side of the join. > For > example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql]. > [Before this > optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg]. > [After this > optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg]. > *Query Performance Benchmarks: TPC-DS Performance Evaluation* > Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and > Partitioned Parquet table > > |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)| > |tpcds q16|84|46| > |tpcds q36|29|21| > |tpcds q57|39|28| > |tpcds q94|42|34| > |tpcds q95|306|288| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31894) Introduce UnsafeRow format validation for streaming state store
[ https://issues.apache.org/jira/browse/SPARK-31894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168564#comment-17168564 ] Apache Spark commented on SPARK-31894: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/29315 > Introduce UnsafeRow format validation for streaming state store > --- > > Key: SPARK-31894 > URL: https://issues.apache.org/jira/browse/SPARK-31894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.1.0 > > > Currently, Structured Streaming directly puts the UnsafeRow into StateStore > without any schema validation. It's a dangerous behavior when users reusing > the checkpoint file during migration. Any changes or bug fix related to the > aggregate function may cause random exceptions, even the wrong answer, e.g > SPARK-28067. > Here we introduce an UnsafeRow format validation for the state store. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32507: - Description: We should make a main package to overview PySpark properly. See the demo example: https://hyukjin-spark.readthedocs.io/en/latest/ was: We should make a main package to overview PySpark properly. See the demo example: https://spark.apache.org/docs/latest/api/python/index.html > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32507: - Description: We should make a main package to overview PySpark properly. See the demo example: https://spark.apache.org/docs/latest/api/python/index.html was:We should make a main package to overview PySpark properly. See the demo example: > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should make a main package to overview PySpark properly. See the demo > example: > https://spark.apache.org/docs/latest/api/python/index.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32507) Main Page
Hyukjin Kwon created SPARK-32507: Summary: Main Page Key: SPARK-32507 URL: https://issues.apache.org/jira/browse/SPARK-32507 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon We should make a main package to overview PySpark properly. See the demo example: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168525#comment-17168525 ] Wenchen Fan commented on SPARK-32506: - cc [~ruifengz] [~weichenxu123] > flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests > > > Key: SPARK-32506 > URL: https://issues.apache.org/jira/browse/SPARK-32506 > Project: Spark > Issue Type: Test > Components: MLlib >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > {code} > FAIL: test_train_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) > Test that error on test data improves as model is trained. > -- > Traceback (most recent call last): > File > "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 466, in test_train_prediction > eventually(condition, timeout=180.0) > File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line > 81, in eventually > lastValue = condition() > File > "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 461, in condition > self.assertGreater(errors[1] - errors[-1], 2) > AssertionError: 1.672640157855923 not greater than 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
Wenchen Fan created SPARK-32506: --- Summary: flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests Key: SPARK-32506 URL: https://issues.apache.org/jira/browse/SPARK-32506 Project: Spark Issue Type: Test Components: MLlib Affects Versions: 3.1.0 Reporter: Wenchen Fan {code} FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) Test that error on test data improves as model is trained. -- Traceback (most recent call last): File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction eventually(condition, timeout=180.0) File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 81, in eventually lastValue = condition() File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32456) Check the Distinct by assuming it as Aggregate for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-32456: Description: We want to fix 2 things here: 1. Give better error message for Distinct related operations in append mode that doesn't have a watermark Check the following example: {code:java} val s1 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s1") val s2 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s2") val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select s2.value, s2.timestamp from s2") unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}){code} We'll get the following confusing exception: {code:java} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) ... {code} The union clause in SQL has the requirement of deduplication, the parser will generate {{Distinct(Union)}} and the optimizer rule {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the root cause here is the checking logic for Aggregate is missing for Distinct. Actually it happens for all Distinct related operations in Structured Streaming, e.g {code:java} val df = spark.readStream.format("rate").load() df.createOrReplaceTempView("deduptest") val distinct = spark.sql("select distinct value from deduptest") distinct.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}){code} 2. Make {{Distinct}} in complete mode runnable. The distinct in complete mode will throw the exception: {quote} {{Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;}} {quote} was: Check the following example: {code:java} val s1 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s1") val s2 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s2") val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select s2.value, s2.timestamp from s2") unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}){code} We'll get the following confusing exception: {code:java} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) ... {code} The union clause in SQL has the requirement of deduplication, the parser will generate {{Distinct(Union)}} and the optimizer rule {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the root cause here is the checking logic for Aggregate is missing for Distinct. Actually it happens for all Distinct related operations in Structured Streaming, e.g {code:java} val df = spark.readStream.format("rate").load() df.createOrReplaceTempView("deduptest") val distinct = spark.sql("select distinct value from deduptest") distinct.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}){code} > Check the Distinct by assuming it as Aggregate for Structured Streaming > --- > > Key: SPARK-32456 > URL: https://issues.apache.org/jira/browse/SPARK-32456 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > We want to fix 2 things here: > 1. Give better error message for Distinct related operations in append mode > that doesn't have a watermark > Check the following example: > {code:java} > val s1 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s1") > val s2 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s2") > val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select > s2.value, s2.timestamp from s2") > unionResult.writeStream.option("checkpointLocation", > ${pathA}).start(
[jira] [Updated] (SPARK-32456) Check the Distinct by assuming it as Aggregate for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-32456: Summary: Check the Distinct by assuming it as Aggregate for Structured Streaming (was: Give better error message for Distinct related operations in append mode without watermark) > Check the Distinct by assuming it as Aggregate for Structured Streaming > --- > > Key: SPARK-32456 > URL: https://issues.apache.org/jira/browse/SPARK-32456 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > Check the following example: > > {code:java} > val s1 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s1") > val s2 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s2") > val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select > s2.value, s2.timestamp from s2") > unionResult.writeStream.option("checkpointLocation", > ${pathA}).start(${pathB}){code} > We'll get the following confusing exception: > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) > at > org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) > ... > {code} > The union clause in SQL has the requirement of deduplication, the parser will > generate {{Distinct(Union)}} and the optimizer rule > {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So > the root cause here is the checking logic for Aggregate is missing for > Distinct. > > Actually it happens for all Distinct related operations in Structured > Streaming, e.g > {code:java} > val df = spark.readStream.format("rate").load() > df.createOrReplaceTempView("deduptest") > val distinct = spark.sql("select distinct value from deduptest") > distinct.writeStream.option("checkpointLocation", > ${pathA}).start(${pathB}){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org