[jira] [Commented] (SPARK-32376) Make unionByName null-filling behavior work with struct columns

2020-07-31 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169241#comment-17169241
 ] 

L. C. Hsieh commented on SPARK-32376:
-

Hi [~mukulmurthy], are you ok if I go to work on this?

> Make unionByName null-filling behavior work with struct columns
> ---
>
> Key: SPARK-32376
> URL: https://issues.apache.org/jira/browse/SPARK-32376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-29358 added support for 
> unionByName to work when the two datasets didn't necessarily have the same 
> schema, but it does not work with nested columns like structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32467) Avoid encoding URL twice on https redirect

2020-07-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32467.

Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/29271

> Avoid encoding URL twice on https redirect
> --
>
> Key: SPARK-32467
> URL: https://issues.apache.org/jira/browse/SPARK-32467
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, on https redirect, the original URL is encoded as an HTTPS URL. 
> However, the original URL could be encoded already, so that the return result 
> of method
> UriInfo.getQueryParameters will contain encoded keys and values. For example, 
> a parameter
> order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded 
> twice, and the decoded
> key in the result of UriInfo.getQueryParameters will be 
> order%5B0%5D%5Bcolumn%5D.
> To fix the problem, we try decoding the query parameters before encoding it. 
> This is to make sure we encode the URL exactly once.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32467) Avoid encoding URL twice on https redirect

2020-07-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-32467:
---
Affects Version/s: 3.0.1

> Avoid encoding URL twice on https redirect
> --
>
> Key: SPARK-32467
> URL: https://issues.apache.org/jira/browse/SPARK-32467
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, on https redirect, the original URL is encoded as an HTTPS URL. 
> However, the original URL could be encoded already, so that the return result 
> of method
> UriInfo.getQueryParameters will contain encoded keys and values. For example, 
> a parameter
> order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded 
> twice, and the decoded
> key in the result of UriInfo.getQueryParameters will be 
> order%5B0%5D%5Bcolumn%5D.
> To fix the problem, we try decoding the query parameters before encoding it. 
> This is to make sure we encode the URL exactly once.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-07-31 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32119:
---
Description: 
ExecutorPlugin can't work with Standalone Cluster and Kubernetes
 when a jar which contains plugins and files used by the plugins are added by 
--jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on 
Executor initialization.
 I confirmed it works with YARN because jars/files are distributed as 
distributed cache.

  was:
ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
manager too except YARN. ) 
 when a jar which contains plugins and files used by the plugins are added by 
--jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on 
Executor initialization.
 I confirmed it works with YARN because jars/files are distributed as 
distributed cache.


> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-07-31 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32119:
---
Summary: ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes 
with --jars  (was: ExecutorPlugin doesn't work with Standalone Cluster)

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-07-31 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169219#comment-17169219
 ] 

JinxinTang edited comment on SPARK-32500 at 8/1/20, 2:24 AM:
-

Hi, [~abhishekd0907],

I have just tested the code in master, branch-2.4.6 and branch-3.0.0, seems all 
can work fine in pyspark as follows:

!image-2020-08-01-10-21-51-246.png!


was (Author: jinxintang):
Hi, [~abhishekd0907],

I have test the code in master, branch-2.4.6 and brach-3.0.0, seems all can 
work fine in pyspark as follows:

!image-2020-08-01-10-21-51-246.png!

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, 
> image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-07-31 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169219#comment-17169219
 ] 

JinxinTang commented on SPARK-32500:


Hi, [~abhishekd0907],

I have test the code in master, branch-2.4.6 and brach-3.0.0, seems all can 
work fine in pyspark as follows:

!image-2020-08-01-10-21-51-246.png!

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, 
> image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-07-31 Thread JinxinTang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-32500:
---
Attachment: image-2020-08-01-10-21-51-246.png

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png, 
> image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32514) Pyspark: Issue using sql query in foreachBatch sink

2020-07-31 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32514.
--
Resolution: Not A Problem

Python doesn't allow abbreviating () with no param, whereas Scala does. Use 
`write()`, not `write`.


> Pyspark: Issue using sql query in foreachBatch sink
> ---
>
> Key: SPARK-32514
> URL: https://issues.apache.org/jira/browse/SPARK-32514
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 2.4.6, 3.0.0
>Reporter: Muru
>Priority: Major
>
> In a pyspark SS job, trying to use sql query instead of DF API methods in 
> foreachBatch sink
> throws AttributeError: 'JavaMember' object has no attribute 'format' 
> exception.
>  
> However, the same thing works in Scala job. 
>  
> Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same 
> exception.
>  
> I noticed that I could perform other operations except the write method.  
>  
> Please, let me know how to fix this issue.
>  
> See below code examples
> # Spark Scala method
> def processData(batchDF: DataFrame, batchId: Long) {
>    batchDF.createOrReplaceTempView("tbl")
>    val outdf=batchDF.sparkSession.sql("select action, count(*) as count from 
> tbl where date='2020-06-20' group by 1")
>    outdf.printSchema()
>    outdf.show
>    outdf.coalesce(1).write.format("csv").save("/tmp/agg")
> }
>  
> ## pyspark python method
> def process_data(bdf, bid):
>   lspark = bdf._jdf.sparkSession()
>   bdf.createOrReplaceTempView("tbl")
>   outdf=lspark.sql("select action, count(*) as count from tbl where 
> date='2020-06-20' group by 1")
>   outdf.printSchema()
>   # it works
>   outdf.show()
>   # throws AttributeError: 'JavaMember' object has no attribute 'format' 
> exception
>   outdf.coalesce(1).write.format("csv").save("/tmp/agg1") 
>  
> Here is the full exception 
> 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = 
> 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = 
> e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error
> py4j.Py4JException: An exception was raised by the Python Proxy. Return 
> Message: Traceback (most recent call last):
>   File 
> "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 2381, in _call_proxy
>     return_value = getattr(self.pool[obj_id], method)(*params)
>   File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call
>     raise e
> AttributeError: 'JavaMember' object has no attribute 'format'
> at py4j.Protocol.getReturnValue(Protocol.java:473)
> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
> at com.sun.proxy.$Proxy20.call(Unknown Source)
> at 
> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
> at 
> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
> at 
> org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> [org.apache.spark.sql.execution.streaming.MicroBatchExecution.org|http://org.apache.spark.sql.execution.streaming.microbatchexecution.org/]$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTim

[jira] [Commented] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169209#comment-17169209
 ] 

Apache Spark commented on SPARK-32402:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29324

> Implement ALTER TABLE in JDBC Table Catalog
> ---
>
> Key: SPARK-32402
> URL: https://issues.apache.org/jira/browse/SPARK-32402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/29168 adds basic implementation 
> of JDBC Table Catalog. This ticket aims to support table altering in:
> - JDBC dialects
> - JDBCTableCatalog
> and to add tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32402:


Assignee: (was: Apache Spark)

> Implement ALTER TABLE in JDBC Table Catalog
> ---
>
> Key: SPARK-32402
> URL: https://issues.apache.org/jira/browse/SPARK-32402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/29168 adds basic implementation 
> of JDBC Table Catalog. This ticket aims to support table altering in:
> - JDBC dialects
> - JDBCTableCatalog
> and to add tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32402:


Assignee: Apache Spark

> Implement ALTER TABLE in JDBC Table Catalog
> ---
>
> Key: SPARK-32402
> URL: https://issues.apache.org/jira/browse/SPARK-32402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/29168 adds basic implementation 
> of JDBC Table Catalog. This ticket aims to support table altering in:
> - JDBC dialects
> - JDBCTableCatalog
> and to add tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32402) Implement ALTER TABLE in JDBC Table Catalog

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169208#comment-17169208
 ] 

Apache Spark commented on SPARK-32402:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29324

> Implement ALTER TABLE in JDBC Table Catalog
> ---
>
> Key: SPARK-32402
> URL: https://issues.apache.org/jira/browse/SPARK-32402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/29168 adds basic implementation 
> of JDBC Table Catalog. This ticket aims to support table altering in:
> - JDBC dialects
> - JDBCTableCatalog
> and to add tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32514) Pyspark: Issue using sql query in foreachBatch sink

2020-07-31 Thread Muru (Jira)
Muru created SPARK-32514:


 Summary: Pyspark: Issue using sql query in foreachBatch sink
 Key: SPARK-32514
 URL: https://issues.apache.org/jira/browse/SPARK-32514
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0, 2.4.6, 2.4.5
Reporter: Muru


In a pyspark SS job, trying to use sql query instead of DF API methods in 
foreachBatch sink
throws AttributeError: 'JavaMember' object has no attribute 'format' exception.
 
However, the same thing works in Scala job. 
 
Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same exception.
 
I noticed that I could perform other operations except the write method.  
 
Please, let me know how to fix this issue.
 
See below code examples
# Spark Scala method
def processData(batchDF: DataFrame, batchId: Long) {
   batchDF.createOrReplaceTempView("tbl")
   val outdf=batchDF.sparkSession.sql("select action, count(*) as count from 
tbl where date='2020-06-20' group by 1")
   outdf.printSchema()
   outdf.show
   outdf.coalesce(1).write.format("csv").save("/tmp/agg")
}
 
## pyspark python method
def process_data(bdf, bid):
  lspark = bdf._jdf.sparkSession()
  bdf.createOrReplaceTempView("tbl")
  outdf=lspark.sql("select action, count(*) as count from tbl where 
date='2020-06-20' group by 1")
  outdf.printSchema()
  # it works
  outdf.show()
  # throws AttributeError: 'JavaMember' object has no attribute 'format' 
exception
  outdf.coalesce(1).write.format("csv").save("/tmp/agg1") 
 
Here is the full exception 
20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = 
854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = 
e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error
py4j.Py4JException: An exception was raised by the Python Proxy. Return 
Message: Traceback (most recent call last):
  File "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 2381, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call
    raise e
AttributeError: 'JavaMember' object has no attribute 'format'
at py4j.Protocol.getReturnValue(Protocol.java:473)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
at com.sun.proxy.$Proxy20.call(Unknown Source)
at 
org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
at 
org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
at 
org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
[org.apache.spark.sql.execution.streaming.MicroBatchExecution.org|http://org.apache.spark.sql.execution.streaming.microbatchexecution.org/]$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at 
[org.apache.spark.sql.e

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-07-31 Thread Sunitha Kambhampati (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169132#comment-17169132
 ] 

Sunitha Kambhampati commented on SPARK-32018:
-

[@cloud-fan|https://github.com/cloud-fan], I noticed the back ports now.  This 
change is more far reaching in its impact as previous callers of 
UnsafeRow.getDecimal that would have thrown an exception earlier would now 
return null.

As an e.g, a caller like aggregate sum will need changes to account for this. 
Earlier cases where sum would throw error for overflow will *now return 
incorrect results*. The new tests that were added for sum overflow cases in the 
DataFrameSuite in master can be used to see repro.

IMO, it would be better to not back port the setDecimal change in isolation. 
wdyt?   Please share your thoughts.  Thanks. 

I added a comment on the pr but since it is closed, adding a comment here.

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column nested inside a StructType 
Column (with similar semantics to the existing {{drop}} method on {{Dataset}}).
It should also be able to handle deeply nested columns through the same API. 
This is similar to the {{withField}} method that was recently added in 
SPARK-31317 and likely we can re-use some of that "infrastructure."

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}


  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}



> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column nested inside a StructType 
> Column (with similar semantics to the existing {{drop}} method on 
> {{Dataset}}).
> It should also be able to handle deeply nested columns through the same API. 
> This is similar to the {{withField}} method that was recently added in 
> SPARK-31317 and likely we can re-use some of that "infrastructure."
> The public-facing method signature should be something along the following 
> lines: 
> {noformat}
> def dropFields(fieldNames: String*): Column
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}


  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).
> This is similar to the {{withField}} that was recently added in SPARK-31317 
> and likely can re-use some of that infrastructure. 
> The public-facing method signature should be something along the following 
> lines: 
> {noformat}
> def dropFields(fieldNames: String*): Column
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).
> This is similar to the {{withField}} that was recently added in SPARK-31317 
> and likely can re-use some of that infrastructure. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket (SPARK-22231) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket (SPARK-22241) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231) and following on, 
> add a new {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the extensive discussions in the parent ticket, it was determined we 
should add a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a column nested inside another 
StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}.
Ideally, this method should be able to handle dropping columns at arbitrary 
levels of nesting in a StructType Column. 




  was:
Based on the discussions in the parent ticket, add a new {{dropFields}} method 
to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the extensive discussions in the parent ticket, it was determined we 
> should add a new {{dropFields}} method to the {{Column}} class.
> This method should allow users to drop a column nested inside another 
> StructType Column, with similar semantics to the {{drop}} method on 
> {{Dataset}}.
> Ideally, this method should be able to handle dropping columns at arbitrary 
> levels of nesting in a StructType Column. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22241) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the extensive discussions in the parent ticket, it was determined we 
should add a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a column nested inside another 
StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}.
Ideally, this method should be able to handle dropping columns at arbitrary 
levels of nesting in a StructType Column. 





> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22241) and following on, 
> add a new {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket, add a new {{dropFields}} method 
to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket, 

Added a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a {{StructField}} in a {{StructType}} 
column (with similar semantics to the {{drop}} method on {{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket, add a new {{dropFields}} 
> method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket, 

Added a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a {{StructField}} in a {{StructType}} 
column (with similar semantics to the {{drop}} method on {{Dataset}}).

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket, 
> Added a new {{dropFields}} method to the {{Column}} class.
> This method should allow users to drop a {{StructField}} in a {{StructType}} 
> column (with similar semantics to the {{drop}} method on {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169087#comment-17169087
 ] 

Rohit Mishra commented on SPARK-32511:
--

[~fqaiser94], Can you please add a description?

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169081#comment-17169081
 ] 

Apache Spark commented on SPARK-32513:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29323

> Rename classes/files with the Jdbc prefix to JDBC
> -
>
> Key: SPARK-32513
> URL: https://issues.apache.org/jira/browse/SPARK-32513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> I have found 7 files with Jdbc:
> {code}
> ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
> ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
> ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
> ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
> {code}
> and 8 starts from JDBC:
> {code}
> ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32513:


Assignee: Apache Spark

> Rename classes/files with the Jdbc prefix to JDBC
> -
>
> Key: SPARK-32513
> URL: https://issues.apache.org/jira/browse/SPARK-32513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> I have found 7 files with Jdbc:
> {code}
> ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
> ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
> ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
> ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
> {code}
> and 8 starts from JDBC:
> {code}
> ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169082#comment-17169082
 ] 

Apache Spark commented on SPARK-32513:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29323

> Rename classes/files with the Jdbc prefix to JDBC
> -
>
> Key: SPARK-32513
> URL: https://issues.apache.org/jira/browse/SPARK-32513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> I have found 7 files with Jdbc:
> {code}
> ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
> ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
> ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
> ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
> {code}
> and 8 starts from JDBC:
> {code}
> ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32513:


Assignee: (was: Apache Spark)

> Rename classes/files with the Jdbc prefix to JDBC
> -
>
> Key: SPARK-32513
> URL: https://issues.apache.org/jira/browse/SPARK-32513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> I have found 7 files with Jdbc:
> {code}
> ➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
> ./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
> ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
> ./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
> {code}
> and 8 starts from JDBC:
> {code}
> ➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32513) Rename classes/files with the Jdbc prefix to JDBC

2020-07-31 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32513:
--

 Summary: Rename classes/files with the Jdbc prefix to JDBC
 Key: SPARK-32513
 URL: https://issues.apache.org/jira/browse/SPARK-32513
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


I have found 7 files with Jdbc:
{code}
➜ apache-spark git:(master) find . -name "Jdbc*.scala" -type f
./core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
./sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/JdbcConnectionUriSuite.scala
{code}
and 8 starts from JDBC:
{code}
➜ apache-spark git:(master) find . -name "JDBC*.scala" -type f
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-07-31 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169050#comment-17169050
 ] 

Maxim Gekk commented on SPARK-32405:


properties passed to createTable are ignored, see 
[https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala#L116]

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement

2020-07-31 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169042#comment-17169042
 ] 

Cheng Su commented on SPARK-31470:
--

[~yumwang] - thanks for creating this jira! Would like to know more details 
underneath.
 # For `SORT BY` here, are you more referring to local sort per spark task, or 
global sort, or some other techniques like z-ordering to handle multiple 
filters? (It seems that databricks delta already supported it - 
[https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html])
 # Where do we plan to store this metadata information in catalog, and what's 
the structure looking like?
 # Are you more targeting improving filter performance, or other operators like 
join and group-by? (where I think a global sort should help save shuffle and 
sort for join and group-by)
 # I feel it should be minor change to support the feature, but how do we drive 
alignments across compute engines like presto and impala in the future? 

> Introduce SORTED BY clause in CREATE TABLE statement
> 
>
> Key: SPARK-31470
> URL: https://issues.apache.org/jira/browse/SPARK-31470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We usually sort on frequently filtered columns when writing data to improve 
> query performance. But there is no these info in the table information.
>  
> {code:sql}
> CREATE TABLE t(day INT, hour INT, year INT, month INT)
> USING parquet
> PARTITIONED BY (year, month)
> SORTED BY (day, hour);
> {code}
>  
> Impala, Oracle and redshift support this clause:
> https://issues.apache.org/jira/browse/IMPALA-4166
> https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B
> https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-07-31 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169041#comment-17169041
 ] 

Huaxin Gao commented on SPARK-32405:


[~maxgekk] Hi Max, could you please explain what these table options are? I 
initially thought you mean createTableOptions, but createTableOptions is 
appended to the create table stmt and no need to generate in `JdbcDialect`.

{code:java}
val sql = s"CREATE TABLE $tableName ($strSchema) $createTableOptions"
statement.executeUpdate(sql)
{code}

I guess you mean something else? Could you please give me an example? Thanks!


> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension

2020-07-31 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32332:
--
Fix Version/s: 3.0.1

> AQE doesn't adequately allow for Columnar Processing extension 
> ---
>
> Key: SPARK-32332
> URL: https://issues.apache.org/jira/browse/SPARK-32332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> In SPARK-27396 we added support to extended Columnar Processing. We did the 
> initial work as to what we thought was sufficient but adaptive query 
> execution was being developed at the same time.
> We have discovered that the changes made to AQE are not sufficient for users 
> to properly extend it for columnar processing because AQE hardcodes to look 
> for specific classes/execs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32512) Add basic partition command for datasourcev2

2020-07-31 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-32512:
---
Description: This Jira is trying to add basic partition command API, 
`AlterTableAddPartitionExec` and `AlterTableDropPartitionExec`,  to support 
operating datasourcev2 partitions. This will use the new partition API defined 
in [SPARK-31694|https://issues.apache.org/jira/browse/SPARK-31694].

> Add basic partition command for datasourcev2
> 
>
> Key: SPARK-32512
> URL: https://issues.apache.org/jira/browse/SPARK-32512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> This Jira is trying to add basic partition command API, 
> `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec`,  to support 
> operating datasourcev2 partitions. This will use the new partition API 
> defined in [SPARK-31694|https://issues.apache.org/jira/browse/SPARK-31694].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32511:


Assignee: (was: Apache Spark)

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32511:


Assignee: Apache Spark

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168966#comment-17168966
 ] 

Apache Spark commented on SPARK-32511:
--

User 'fqaiser94' has created a pull request for this issue:
https://github.com/apache/spark/pull/29322

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32512) Add basic partition command for datasourcev2

2020-07-31 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-32512:
--

 Summary: Add basic partition command for datasourcev2
 Key: SPARK-32512
 URL: https://issues.apache.org/jira/browse/SPARK-32512
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Jackey Lee






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)
fqaiser94 created SPARK-32511:
-

 Summary: Add dropFields method to Column class
 Key: SPARK-32511
 URL: https://issues.apache.org/jira/browse/SPARK-32511
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: fqaiser94






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168946#comment-17168946
 ] 

Apache Spark commented on SPARK-32083:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29321

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168945#comment-17168945
 ] 

Apache Spark commented on SPARK-32083:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29321

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2020-07-31 Thread Xuyan Xiao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168904#comment-17168904
 ] 

Xuyan Xiao commented on SPARK-22947:


[~icexelloss] Any update on this issue?

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3. Implementation A
> (This is just initial thought of how to implement

[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168868#comment-17168868
 ] 

SHOBHIT SHUKLA commented on SPARK-32495:


[~prashant] Here is the link where Jackson databind community mentioned in 
which version they have fixed above mentioned CVEs.
https://github.com/advisories/GHSA-h592-38cm-4ggp
https://github.com/advisories/GHSA-w3f4-3q6j-rh82

> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs 
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to 
> upgrade the jackson version for spark-2.4.6 and so on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-32495:
---
Description: 
As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
CVE-2017-15095 and CVE-2018-5968 CVEs 
[https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to 
upgrade the jackson version for spark-2.4.6 and so on(2.4.x).


  was:
As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for jackson-databind 
2.6.7.3], Would it be possible to upgrade the jackson version for spark-2.4.6 
and so on(2.4.x).



> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs 
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to 
> upgrade the jackson version for spark-2.4.6 and so on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168853#comment-17168853
 ] 

SHOBHIT SHUKLA edited comment on SPARK-32495 at 7/31/20, 1:15 PM:
--

Above description updated with actual CVEs what we are looking for fix, The NVD 
databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for 
jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968


was (Author: sshukla05):
The NVD databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for 
jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968

> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for 
> jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson 
> version for spark-2.4.6 and so on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-32495:
---
Description: 
As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
CVE-2017-15095 and CVE-2018-5968 CVEs 
[https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
on(2.4.x).


  was:
Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
[https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
on(2.4.x).



> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs 
> [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
> possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
> on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-32495:
---
Description: 
As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for jackson-databind 
2.6.7.3], Would it be possible to upgrade the jackson version for spark-2.4.6 
and so on(2.4.x).


  was:
As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
CVE-2017-15095 and CVE-2018-5968 CVEs 
[https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
on(2.4.x).



> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs [as a vulnerability for 
> jackson-databind 2.6.7.3], Would it be possible to upgrade the jackson 
> version for spark-2.4.6 and so on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-07-31 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168856#comment-17168856
 ] 

Rohit Mishra commented on SPARK-32509:
--

[~prakharjain09], Meanwhile you are working on the pull request, can you please 
attach code sample, Environment detail or output snapshot related to the bug 
for reference of the wider audience and to avoid any duplicate issue. Thanks.

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread SHOBHIT SHUKLA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168853#comment-17168853
 ] 

SHOBHIT SHUKLA commented on SPARK-32495:


The NVD databases lists CVE-2017-15095 and CVE-2018-5968 as a vulnerability for 
jackson-databind 2.6.7.3 release. https://nvd.nist.gov/vuln/detail/CVE-2018-5968

> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
> possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
> on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32507) Main Page

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168850#comment-17168850
 ] 

Apache Spark commented on SPARK-32507:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29320

> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32507) Main Page

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32507:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32507) Main Page

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32507:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32507) Main Page

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168848#comment-17168848
 ] 

Apache Spark commented on SPARK-32507:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29320

> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32480) Support insert overwrite to move the data to trash

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168833#comment-17168833
 ] 

Apache Spark commented on SPARK-32480:
--

User 'Udbhav30' has created a pull request for this issue:
https://github.com/apache/spark/pull/29319

> Support insert overwrite to move the data to trash 
> ---
>
> Key: SPARK-32480
> URL: https://issues.apache.org/jira/browse/SPARK-32480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168831#comment-17168831
 ] 

Apache Spark commented on SPARK-32509:
--

User 'prakharjain09' has created a pull request for this issue:
https://github.com/apache/spark/pull/29318

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32480) Support insert overwrite to move the data to trash

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168829#comment-17168829
 ] 

Apache Spark commented on SPARK-32480:
--

User 'Udbhav30' has created a pull request for this issue:
https://github.com/apache/spark/pull/29319

> Support insert overwrite to move the data to trash 
> ---
>
> Key: SPARK-32480
> URL: https://issues.apache.org/jira/browse/SPARK-32480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168830#comment-17168830
 ] 

Apache Spark commented on SPARK-32510:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29317

> JDBC doesn't check duplicate column names in nested structures 
> ---
>
> Key: SPARK-32510
> URL: https://issues.apache.org/jira/browse/SPARK-32510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
> duplicates on top-level but not in nested structures as other built-in 
> datasources do, see
> [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32480) Support insert overwrite to move the data to trash

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32480:


Assignee: Apache Spark

> Support insert overwrite to move the data to trash 
> ---
>
> Key: SPARK-32480
> URL: https://issues.apache.org/jira/browse/SPARK-32480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Apache Spark
>Priority: Minor
>
> Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32509:


Assignee: (was: Apache Spark)

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32509:


Assignee: Apache Spark

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Assignee: Apache Spark
>Priority: Major
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32510:


Assignee: Apache Spark

> JDBC doesn't check duplicate column names in nested structures 
> ---
>
> Key: SPARK-32510
> URL: https://issues.apache.org/jira/browse/SPARK-32510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
> duplicates on top-level but not in nested structures as other built-in 
> datasources do, see
> [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32480) Support insert overwrite to move the data to trash

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32480:


Assignee: (was: Apache Spark)

> Support insert overwrite to move the data to trash 
> ---
>
> Key: SPARK-32480
> URL: https://issues.apache.org/jira/browse/SPARK-32480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32510:


Assignee: (was: Apache Spark)

> JDBC doesn't check duplicate column names in nested structures 
> ---
>
> Key: SPARK-32510
> URL: https://issues.apache.org/jira/browse/SPARK-32510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
> duplicates on top-level but not in nested structures as other built-in 
> datasources do, see
> [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168825#comment-17168825
 ] 

Prashant Sharma commented on SPARK-32495:
-

Furthermore, according to 
https://github.com/FasterXML/jackson-databind/commits/2.6 , the version 2.6.7.3 
has fixes to all the CVE upto version 2.9.10 which is >= 2.9.8.
 

> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
> possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
> on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-07-31 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32510:
--

 Summary: JDBC doesn't check duplicate column names in nested 
structures 
 Key: SPARK-32510
 URL: https://issues.apache.org/jira/browse/SPARK-32510
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
duplicates on top-level but not in nested structures as other built-in 
datasources do, see

[https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-07-31 Thread Prakhar Jain (Jira)
Prakhar Jain created SPARK-32509:


 Summary: Unused DPP Filter causes issue in canonicalization and 
prevents reuse exchange
 Key: SPARK-32509
 URL: https://issues.apache.org/jira/browse/SPARK-32509
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Prakhar Jain


As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` partition 
filter inside the FileSourceScanExec affects the canonicalization of the node 
and so in many cases, this can prevent ReuseExchange from happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32508:


Assignee: (was: Apache Spark)

> Disallow empty part col values in partition spec before static partition 
> writing
> 
>
> Key: SPARK-32508
> URL: https://issues.apache.org/jira/browse/SPARK-32508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When writing to the current static partition, the partition field is empty, 
> and an error will be reported when all tasks are completed.
> We can prevent such behavior before submitting the task.
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key d is null or empty;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-07-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32508:


Assignee: Apache Spark

> Disallow empty part col values in partition spec before static partition 
> writing
> 
>
> Key: SPARK-32508
> URL: https://issues.apache.org/jira/browse/SPARK-32508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Trivial
>
> When writing to the current static partition, the partition field is empty, 
> and an error will be reported when all tasks are completed.
> We can prevent such behavior before submitting the task.
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key d is null or empty;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168630#comment-17168630
 ] 

Apache Spark commented on SPARK-32508:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/29316

> Disallow empty part col values in partition spec before static partition 
> writing
> 
>
> Key: SPARK-32508
> URL: https://issues.apache.org/jira/browse/SPARK-32508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When writing to the current static partition, the partition field is empty, 
> and an error will be reported when all tasks are completed.
> We can prevent such behavior before submitting the task.
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key d is null or empty;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-07-31 Thread dzcxzl (Jira)
dzcxzl created SPARK-32508:
--

 Summary: Disallow empty part col values in partition spec before 
static partition writing
 Key: SPARK-32508
 URL: https://issues.apache.org/jira/browse/SPARK-32508
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: dzcxzl


When writing to the current static partition, the partition field is empty, and 
an error will be reported when all tasks are completed.
We can prevent such behavior before submitting the task.

 
{code:java}
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key 
d is null or empty;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-31 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168628#comment-17168628
 ] 

Prashant Sharma commented on SPARK-32495:
-

Is this not specific to, jackson-databind?

And the link says, version 2.6.7.3 also has the fix to CVEs.

It appears, the issue SPARK-30333 already fixed it. Did I miss anything here?

> Update jackson versions from 2.4.6 and so on(2.4.x)
> ---
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
> possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
> on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32268) Bloom Filter Join

2020-07-31 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168582#comment-17168582
 ] 

Yuming Wang commented on SPARK-32268:
-

{code:scala}
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.util.sketch.BloomFilter

val N = 2
val path = "/tmp/spark/bloomfilter"
spark.range(N).write.mode("overwrite").parquet(path)

val benchmark = new Benchmark(s"Benchmark bloom filter, HashSet and In", 
valuesPerIteration = N, minNumIters = 2)

Seq(10, 100, 1000, 1, 10, 100, 500, 1000).foreach { items =>
  benchmark.addCase(s"bloom filter with ${items}") { _ =>
val br = BloomFilter.create(items)
(1 to items).foreach(br.putLong(_))
println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(br))
spark.read.parquet(path).filter(r => 
br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save()
  }

  benchmark.addCase(s"HashSet with ${items}") { _ =>
val hs = new java.util.HashSet[Long](items)
(1 to items).foreach(hs.add(_))
println(com.carrotsearch.sizeof.RamUsageEstimator.sizeOf(hs))
spark.read.parquet(path).filter(r => 
hs.contains(r.getLong(0))).write.format("noop").mode("overwrite").save()
  }

  if (items <= 10) {
benchmark.addCase(s"in with ${items}") { _ =>
  spark.read.parquet(path).filter(s"id in (${(1 to items).mkString(", 
")})").write.format("noop").mode("overwrite").save()
}
  }
}
benchmark.run()
}
{code}
 
|num Items|bloom filter(ms)|hash set(ms)|In predicate(ms)|sizeOf bloom 
filter(bytes)|sizeOf hash set(bytes)|
|10|6635|4147|602|80|720|
|100|7374|3908|3195|160|6720|
|1000|7490|5095|4126|984|64288|
|1|7533|4196|6081|9192|625632|
|10|7838|5154|23406|91296|6648672|
|100|8950|8425|N/A|912376|64388704|
|500|10465|25624|N/A|4561592|313554528|
|1000|19287|104596|N/A|9123120|627108960|

> Bloom Filter Join
> -
>
> Key: SPARK-32268
> URL: https://issues.apache.org/jira/browse/SPARK-32268
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: q16-bloom-filter.jpg, q16-default.jpg
>
>
> We can improve the performance of some joins by pre-filtering one side of a 
> join using a Bloom filter and IN predicate generated from the values from the 
> other side of the join.
>  For 
> example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql].
>  [Before this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg].
>  [After this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg].
> *Query Performance Benchmarks: TPC-DS Performance Evaluation*
>  Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and 
> Partitioned Parquet table
>  
> |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)|
> |tpcds q16|84|46|
> |tpcds q36|29|21|
> |tpcds q57|39|28|
> |tpcds q94|42|34|
> |tpcds q95|306|288|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31894) Introduce UnsafeRow format validation for streaming state store

2020-07-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168564#comment-17168564
 ] 

Apache Spark commented on SPARK-31894:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/29315

> Introduce UnsafeRow format validation for streaming state store
> ---
>
> Key: SPARK-31894
> URL: https://issues.apache.org/jira/browse/SPARK-31894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, Structured Streaming directly puts the UnsafeRow into StateStore 
> without any schema validation. It's a dangerous behavior when users reusing 
> the checkpoint file during migration. Any changes or bug fix related to the 
> aggregate function may cause random exceptions, even the wrong answer, e.g 
> SPARK-28067.
> Here we introduce an UnsafeRow format validation for the state store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32507) Main Page

2020-07-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32507:
-
Description: 
We should make a main package to overview PySpark properly. See the demo 
example:

 https://hyukjin-spark.readthedocs.io/en/latest/

  was:
We should make a main package to overview PySpark properly. See the demo 
example:

https://spark.apache.org/docs/latest/api/python/index.html


> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32507) Main Page

2020-07-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32507:
-
Description: 
We should make a main package to overview PySpark properly. See the demo 
example:

https://spark.apache.org/docs/latest/api/python/index.html

  was:We should make a main package to overview PySpark properly. See the demo 
example:


> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
> https://spark.apache.org/docs/latest/api/python/index.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32507) Main Page

2020-07-31 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32507:


 Summary: Main Page
 Key: SPARK-32507
 URL: https://issues.apache.org/jira/browse/SPARK-32507
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


We should make a main package to overview PySpark properly. See the demo 
example:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-07-31 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168525#comment-17168525
 ] 

Wenchen Fan commented on SPARK-32506:
-

cc [~ruifengz] [~weichenxu123]

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-07-31 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-32506:
---

 Summary: flaky test: 
pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
 Key: SPARK-32506
 URL: https://issues.apache.org/jira/browse/SPARK-32506
 Project: Spark
  Issue Type: Test
  Components: MLlib
Affects Versions: 3.1.0
Reporter: Wenchen Fan


{code}
FAIL: test_train_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that error on test data improves as model is trained.
--
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 466, in test_train_prediction
eventually(condition, timeout=180.0)
  File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
81, in eventually
lastValue = condition()
  File 
"/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 461, in condition
self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32456) Check the Distinct by assuming it as Aggregate for Structured Streaming

2020-07-31 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-32456:

Description: 
We want to fix 2 things here:

1. Give better error message for Distinct related operations in append mode 
that doesn't have a watermark

Check the following example: 
{code:java}
val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s1")
val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s2")
val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
s2.value, s2.timestamp from s2")
unionResult.writeStream.option("checkpointLocation", 
${pathA}).start(${pathB}){code}
We'll get the following confusing exception:
{code:java}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
at 
org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
...
{code}
The union clause in SQL has the requirement of deduplication, the parser will 
generate {{Distinct(Union)}} and the optimizer rule 
{{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the 
root cause here is the checking logic for Aggregate is missing for Distinct.

Actually it happens for all Distinct related operations in Structured 
Streaming, e.g
{code:java}
val df = spark.readStream.format("rate").load()
df.createOrReplaceTempView("deduptest")
val distinct = spark.sql("select distinct value from deduptest")
distinct.writeStream.option("checkpointLocation", 
${pathA}).start(${pathB}){code}
 

2. Make {{Distinct}} in complete mode runnable.

The distinct in complete mode will throw the exception:

{quote} 

{{Complete output mode not supported when there are no streaming aggregations 
on streaming DataFrames/Datasets;}}

{quote}

  was:
Check the following example:

 
{code:java}
val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s1")
val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s2")
val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
s2.value, s2.timestamp from s2")
unionResult.writeStream.option("checkpointLocation", 
${pathA}).start(${pathB}){code}
We'll get the following confusing exception:
{code:java}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
at 
org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
...
{code}
The union clause in SQL has the requirement of deduplication, the parser will 
generate {{Distinct(Union)}} and the optimizer rule 
{{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the 
root cause here is the checking logic for Aggregate is missing for Distinct.

 

Actually it happens for all Distinct related operations in Structured 
Streaming, e.g
{code:java}
val df = spark.readStream.format("rate").load()
df.createOrReplaceTempView("deduptest")
val distinct = spark.sql("select distinct value from deduptest")
distinct.writeStream.option("checkpointLocation", 
${pathA}).start(${pathB}){code}
 


> Check the Distinct by assuming it as Aggregate for Structured Streaming
> ---
>
> Key: SPARK-32456
> URL: https://issues.apache.org/jira/browse/SPARK-32456
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> We want to fix 2 things here:
> 1. Give better error message for Distinct related operations in append mode 
> that doesn't have a watermark
> Check the following example: 
> {code:java}
> val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s1")
> val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s2")
> val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
> s2.value, s2.timestamp from s2")
> unionResult.writeStream.option("checkpointLocation", 
> ${pathA}).start(

[jira] [Updated] (SPARK-32456) Check the Distinct by assuming it as Aggregate for Structured Streaming

2020-07-31 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-32456:

Summary: Check the Distinct by assuming it as Aggregate for Structured 
Streaming  (was: Give better error message for Distinct related operations in 
append mode without watermark)

> Check the Distinct by assuming it as Aggregate for Structured Streaming
> ---
>
> Key: SPARK-32456
> URL: https://issues.apache.org/jira/browse/SPARK-32456
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Check the following example:
>  
> {code:java}
> val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s1")
> val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s2")
> val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
> s2.value, s2.timestamp from s2")
> unionResult.writeStream.option("checkpointLocation", 
> ${pathA}).start(${pathB}){code}
> We'll get the following confusing exception:
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
> ...
> {code}
> The union clause in SQL has the requirement of deduplication, the parser will 
> generate {{Distinct(Union)}} and the optimizer rule 
> {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So 
> the root cause here is the checking logic for Aggregate is missing for 
> Distinct.
>  
> Actually it happens for all Distinct related operations in Structured 
> Streaming, e.g
> {code:java}
> val df = spark.readStream.format("rate").load()
> df.createOrReplaceTempView("deduptest")
> val distinct = spark.sql("select distinct value from deduptest")
> distinct.writeStream.option("checkpointLocation", 
> ${pathA}).start(${pathB}){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org