date:20220626

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Chandrashekar updated SPARK-39605:

Description: 
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=630,height=75!

*Below is the image that shows failure in 10.4 LTS:*

!image-2022-06-27-11-00-50-119.png|width=624,height=64!

And I have validated that there is no field in our schema that has NullType. In 
fact when the schema was inferred, there were Null & void type fields which 
were converted to string using my UDF. This issue will persists even when I 
infer schema on complete dataset, that is, samplePoolSize is on full data set.

  was:
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

And I have validated that there is no field in our schema that has NullType. In 
fact when the schema was inferred, there were Null & void type fields which 
were converted to string using my UDF. This issue will persists even when I 
infer schema on complete dataset, that is, samplePoolSize is on full data set.


> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
> Attachments: image-2022-06-27-11-00-50-119.png
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=630,height=75!
> *Below is the image that shows failure in 10.4 LTS:*
> !image-2022-06-27-11-00-50-119.png|width=624,height=64!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Chandrashekar updated SPARK-39605:

Attachment: image-2022-06-27-11-00-50-119.png

> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
> Attachments: image-2022-06-27-11-00-50-119.png
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isempty(), or an exception will be throwed.

2022-06-26 Thread Zhu JunYong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu JunYong updated SPARK-39612:

Environment: 
OS: centos stream 8
{code:java}
$ uname -a
Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ python --version
Python 3.8.13 

$ pyspark --version
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark
Type --help for more information.

$ java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
{code}
 

  was:
{code:java}
$ uname -a
Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ python --version
Python 3.8.13 

$ pyspark --version
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark
Type --help for more information.

$ java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
{code}
 


> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isempty(), or an exception will be throwed.
> -
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Priority: Major
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be throwed.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
>

[jira] [Created] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isempty(), or an exception will be throwed.

2022-06-26 Thread Zhu JunYong (Jira)

Zhu JunYong created SPARK-39612:
---

 Summary: The dataframe returned by exceptAll() can no longer 
perform operations such as count() or isempty(), or an exception will be 
throwed.
 Key: SPARK-39612
 URL: https://issues.apache.org/jira/browse/SPARK-39612
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
 Environment: {code:java}
$ uname -a
Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ python --version
Python 3.8.13 

$ pyspark --version
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark
Type --help for more information.

$ java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
{code}
 
Reporter: Zhu JunYong


As I said, the dataframe returned by `exceptAll()` can no longer perform 
operations such as `count()` or `isEmpty()`, or an exception will be throwed.

 

 
{code:java}
>>> d1 = spark.createDataFrame([("a")], 'STRING')
>>> d1.show()
+-+
|value|
+-+
|    a|
+-+
>>> d2 = d1.exceptAll(d1)
>>> d2.show()
+-+
|value|
+-+
+-+
>>> d2.count()
22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 525)
java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
    at scala.collection.immutable.List.map(List.scala:297)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
    at 
org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
    at 
org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
    at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
    at 
org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
    at 
org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
    at 
org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
    at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
    at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at

[jira] [Commented] (SPARK-39253) Improve PySpark API reference to be more readable

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558972#comment-17558972
 ] 

Apache Spark commented on SPARK-39253:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36997

> Improve PySpark API reference to be more readable
> -
>
> Key: SPARK-39253
> URL: https://issues.apache.org/jira/browse/SPARK-39253
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Currently, the PySpark documentation especially ["Spark SQL" 
> part|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#]
>  is not-well organized so it's a bit uncomfortable to be read.
> For example, the 
> [pyspark.sql.SparkSession|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html]
>  class shows every available method its sub-contents, but 
> pyspark.sql.DataFrameReader doesn't have it's own top-level class page, and 
> also its available methods.
> So we might need to refine the document to make it more readable so that 
> users can easily find the methods they want.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39253) Improve PySpark API reference to be more readable

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558970#comment-17558970
 ] 

Apache Spark commented on SPARK-39253:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36997

> Improve PySpark API reference to be more readable
> -
>
> Key: SPARK-39253
> URL: https://issues.apache.org/jira/browse/SPARK-39253
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Currently, the PySpark documentation especially ["Spark SQL" 
> part|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#]
>  is not-well organized so it's a bit uncomfortable to be read.
> For example, the 
> [pyspark.sql.SparkSession|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html]
>  class shows every available method its sub-contents, but 
> pyspark.sql.DataFrameReader doesn't have it's own top-level class page, and 
> also its available methods.
> So we might need to refine the document to make it more readable so that 
> users can easily find the methods they want.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39253) Improve PySpark API reference to be more readable

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558971#comment-17558971
 ] 

Apache Spark commented on SPARK-39253:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36997

> Improve PySpark API reference to be more readable
> -
>
> Key: SPARK-39253
> URL: https://issues.apache.org/jira/browse/SPARK-39253
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Currently, the PySpark documentation especially ["Spark SQL" 
> part|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#]
>  is not-well organized so it's a bit uncomfortable to be read.
> For example, the 
> [pyspark.sql.SparkSession|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html]
>  class shows every available method its sub-contents, but 
> pyspark.sql.DataFrameReader doesn't have it's own top-level class page, and 
> also its available methods.
> So we might need to refine the document to make it more readable so that 
> users can easily find the methods they want.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39394) Improve PySpark structured streaming page more readable

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558969#comment-17558969
 ] 

Apache Spark commented on SPARK-39394:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36997

> Improve PySpark structured streaming page more readable
> ---
>
> Key: SPARK-39394
> URL: https://issues.apache.org/jira/browse/SPARK-39394
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Similar to SPARK-39253, the PySpark documentation for "Structured Streaming" 
> part is not-well organized so it's a bit uncomfortable to be read.
> So we might need to refine the document to make it more readable so that 
> users can easily find the methods they want.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39394) Improve PySpark structured streaming page more readable

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558968#comment-17558968
 ] 

Apache Spark commented on SPARK-39394:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36997

> Improve PySpark structured streaming page more readable
> ---
>
> Key: SPARK-39394
> URL: https://issues.apache.org/jira/browse/SPARK-39394
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Similar to SPARK-39253, the PySpark documentation for "Structured Streaming" 
> part is not-well organized so it's a bit uncomfortable to be read.
> So we might need to refine the document to make it more readable so that 
> users can easily find the methods they want.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials

2022-06-26 Thread Lily Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558967#comment-17558967
 ] 

Lily Kim commented on SPARK-38934:
--

Since our system set the provider as WebIdentityTokenCredentialsProvider as a 
default, I had to explicitly set as TemporaryAWSCredentialsProvider.

Otherwise, AWS SDK returns Access Denied Error.

> Provider TemporaryAWSCredentialsProvider has no credentials
> ---
>
> Key: SPARK-38934
> URL: https://issues.apache.org/jira/browse/SPARK-38934
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.1
>Reporter: Lily
>Priority: Major
>
>  
> We are using Jupyter Hub on K8s as a notebook based development environment 
> and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 
> and Hadoop 3.3.1.
> When we run a code like the one below in the Jupyter Hub on K8s,
>  
> {code:java}
> val perm = ... // get AWS temporary credential by AWS STS from AWS assumed 
> role
> // set AWS temporary credential
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> perm.credential.accessKeyID)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> perm.credential.secretAccessKey)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", 
> perm.credential.sessionToken)
> // execute simple Spark action
> spark.read.format("parquet").load("s3a:///*").show(1) {code}
>  
>  
> the first few executors left a warning like the one below in the first code 
> execution, but we were able to get the proper result thanks to Spark task 
> retry function. 
> {code:java}
> 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) 
> (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: 
> s3a:///.parquet: 
> org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider 
> TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810)
>   at 
> org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: 
> Provider TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130)
>   at 
> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:842)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:792)
>   at 
>

[jira] [Commented] (SPARK-34305) Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558965#comment-17558965
 ] 

Apache Spark commented on SPARK-34305:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36996

> Unify v1 and v2 ALTER TABLE .. SET SERDE tests
> --
>
> Key: SPARK-34305
> URL: https://issues.apache.org/jira/browse/SPARK-34305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extract ALTER TABLE .. SET SERDE tests to the common place to run them for V1 
> and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34305) Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34305:


Assignee: Max Gekk  (was: Apache Spark)

> Unify v1 and v2 ALTER TABLE .. SET SERDE tests
> --
>
> Key: SPARK-34305
> URL: https://issues.apache.org/jira/browse/SPARK-34305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extract ALTER TABLE .. SET SERDE tests to the common place to run them for V1 
> and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34305) Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558964#comment-17558964
 ] 

Apache Spark commented on SPARK-34305:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36996

> Unify v1 and v2 ALTER TABLE .. SET SERDE tests
> --
>
> Key: SPARK-34305
> URL: https://issues.apache.org/jira/browse/SPARK-34305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extract ALTER TABLE .. SET SERDE tests to the common place to run them for V1 
> and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34305) Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34305:


Assignee: Apache Spark  (was: Max Gekk)

> Unify v1 and v2 ALTER TABLE .. SET SERDE tests
> --
>
> Key: SPARK-34305
> URL: https://issues.apache.org/jira/browse/SPARK-34305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Extract ALTER TABLE .. SET SERDE tests to the common place to run them for V1 
> and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39515) Improve/recover scheduled jobs in GitHub Actions

2022-06-26 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558950#comment-17558950
 ] 

Yikun Jiang edited comment on SPARK-39515 at 6/27/22 2:13 AM:
--

Maybe we could move SPARK-39609 SPARK-39610 SPARK-39611 in a separate umbrella 
to support latest image with cache speed. [~hyukjin.kwon] 


was (Author: yikunkero):
Maybe we could move SPARK-39609 SPARK-39610 SPARK-39611 in a separate umbrella. 
[~hyukjin.kwon] 

> Improve/recover scheduled jobs in GitHub Actions
> 
>
> Key: SPARK-39515
> URL: https://issues.apache.org/jira/browse/SPARK-39515
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> There are five problems to address.
> *First*, the scheduled jobs are broken as below:
> https://github.com/apache/spark/actions/runs/2513261706
> https://github.com/apache/spark/actions/runs/2512750310
> https://github.com/apache/spark/actions/runs/2509238648
> https://github.com/apache/spark/actions/runs/2508246903
> https://github.com/apache/spark/actions/runs/2507327914
> https://github.com/apache/spark/actions/runs/2506654808
> https://github.com/apache/spark/actions/runs/2506143939
> https://github.com/apache/spark/actions/runs/2502449498
> https://github.com/apache/spark/actions/runs/2501400490
> https://github.com/apache/spark/actions/runs/2500407628
> https://github.com/apache/spark/actions/runs/2499722093
> https://github.com/apache/spark/actions/runs/2499196539
> https://github.com/apache/spark/actions/runs/2496544415
> https://github.com/apache/spark/actions/runs/2495444227
> https://github.com/apache/spark/actions/runs/2493402272
> https://github.com/apache/spark/actions/runs/2492759618
> https://github.com/apache/spark/actions/runs/2492227816
> See also https://github.com/apache/spark/pull/36899 or 
> https://github.com/apache/spark/pull/36890
> In the master branch, seems like at least Hadoop 2 build is broken currently.
> *Second*, it is very difficult to navigate scheduled jobs now. We should use 
> https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule
>  link and manually search one by one.
> Since GitHub added the feature to import other workflow, we should leverage 
> this feature, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test_ansi.yml
>  and https://docs.github.com/en/actions/using-workflows/reusing-workflows. 
> Once we can separate them, it will be defined as a separate workflow.
> Namely, each scheduled job should be classified under "All workflows" at 
> https://github.com/apache/spark/actions so other developers can easily track 
> them.
> *Third*, we should set the scheduled jobs for branch-3.3, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L78-L83
>  for branch-3.2 job.
> *Forth*, we should improve duplicated test skipping logic. See also 
> https://github.com/apache/spark/pull/36413#issuecomment-1157205469 and 
> https://github.com/apache/spark/pull/36888
> *Fifth*, we should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39515) Improve/recover scheduled jobs in GitHub Actions

2022-06-26 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558950#comment-17558950
 ] 

Yikun Jiang edited comment on SPARK-39515 at 6/27/22 2:13 AM:
--

Maybe we could move SPARK-39609 SPARK-39610 SPARK-39611 in a separate umbrella 
to support latest image with cache speed under SPARK-39522. [~hyukjin.kwon] 


was (Author: yikunkero):
Maybe we could move SPARK-39609 SPARK-39610 SPARK-39611 in a separate umbrella 
to support latest image with cache speed. [~hyukjin.kwon] 

> Improve/recover scheduled jobs in GitHub Actions
> 
>
> Key: SPARK-39515
> URL: https://issues.apache.org/jira/browse/SPARK-39515
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> There are five problems to address.
> *First*, the scheduled jobs are broken as below:
> https://github.com/apache/spark/actions/runs/2513261706
> https://github.com/apache/spark/actions/runs/2512750310
> https://github.com/apache/spark/actions/runs/2509238648
> https://github.com/apache/spark/actions/runs/2508246903
> https://github.com/apache/spark/actions/runs/2507327914
> https://github.com/apache/spark/actions/runs/2506654808
> https://github.com/apache/spark/actions/runs/2506143939
> https://github.com/apache/spark/actions/runs/2502449498
> https://github.com/apache/spark/actions/runs/2501400490
> https://github.com/apache/spark/actions/runs/2500407628
> https://github.com/apache/spark/actions/runs/2499722093
> https://github.com/apache/spark/actions/runs/2499196539
> https://github.com/apache/spark/actions/runs/2496544415
> https://github.com/apache/spark/actions/runs/2495444227
> https://github.com/apache/spark/actions/runs/2493402272
> https://github.com/apache/spark/actions/runs/2492759618
> https://github.com/apache/spark/actions/runs/2492227816
> See also https://github.com/apache/spark/pull/36899 or 
> https://github.com/apache/spark/pull/36890
> In the master branch, seems like at least Hadoop 2 build is broken currently.
> *Second*, it is very difficult to navigate scheduled jobs now. We should use 
> https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule
>  link and manually search one by one.
> Since GitHub added the feature to import other workflow, we should leverage 
> this feature, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test_ansi.yml
>  and https://docs.github.com/en/actions/using-workflows/reusing-workflows. 
> Once we can separate them, it will be defined as a separate workflow.
> Namely, each scheduled job should be classified under "All workflows" at 
> https://github.com/apache/spark/actions so other developers can easily track 
> them.
> *Third*, we should set the scheduled jobs for branch-3.3, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L78-L83
>  for branch-3.2 job.
> *Forth*, we should improve duplicated test skipping logic. See also 
> https://github.com/apache/spark/pull/36413#issuecomment-1157205469 and 
> https://github.com/apache/spark/pull/36888
> *Fifth*, we should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39515) Improve/recover scheduled jobs in GitHub Actions

2022-06-26 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558950#comment-17558950
 ] 

Yikun Jiang commented on SPARK-39515:
-

Maybe we could move SPARK-39609 SPARK-39610 SPARK-39611 in a separate umbrella. 
[~hyukjin.kwon] 

> Improve/recover scheduled jobs in GitHub Actions
> 
>
> Key: SPARK-39515
> URL: https://issues.apache.org/jira/browse/SPARK-39515
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> There are five problems to address.
> *First*, the scheduled jobs are broken as below:
> https://github.com/apache/spark/actions/runs/2513261706
> https://github.com/apache/spark/actions/runs/2512750310
> https://github.com/apache/spark/actions/runs/2509238648
> https://github.com/apache/spark/actions/runs/2508246903
> https://github.com/apache/spark/actions/runs/2507327914
> https://github.com/apache/spark/actions/runs/2506654808
> https://github.com/apache/spark/actions/runs/2506143939
> https://github.com/apache/spark/actions/runs/2502449498
> https://github.com/apache/spark/actions/runs/2501400490
> https://github.com/apache/spark/actions/runs/2500407628
> https://github.com/apache/spark/actions/runs/2499722093
> https://github.com/apache/spark/actions/runs/2499196539
> https://github.com/apache/spark/actions/runs/2496544415
> https://github.com/apache/spark/actions/runs/2495444227
> https://github.com/apache/spark/actions/runs/2493402272
> https://github.com/apache/spark/actions/runs/2492759618
> https://github.com/apache/spark/actions/runs/2492227816
> See also https://github.com/apache/spark/pull/36899 or 
> https://github.com/apache/spark/pull/36890
> In the master branch, seems like at least Hadoop 2 build is broken currently.
> *Second*, it is very difficult to navigate scheduled jobs now. We should use 
> https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule
>  link and manually search one by one.
> Since GitHub added the feature to import other workflow, we should leverage 
> this feature, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test_ansi.yml
>  and https://docs.github.com/en/actions/using-workflows/reusing-workflows. 
> Once we can separate them, it will be defined as a separate workflow.
> Namely, each scheduled job should be classified under "All workflows" at 
> https://github.com/apache/spark/actions so other developers can easily track 
> them.
> *Third*, we should set the scheduled jobs for branch-3.3, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L78-L83
>  for branch-3.2 job.
> *Forth*, we should improve duplicated test skipping logic. See also 
> https://github.com/apache/spark/pull/36413#issuecomment-1157205469 and 
> https://github.com/apache/spark/pull/36888
> *Fifth*, we should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39611) PySpark support numpy 1.23.X

2022-06-26 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39611:
---

 Summary: PySpark support numpy 1.23.X
 Key: SPARK-39611
 URL: https://issues.apache.org/jira/browse/SPARK-39611
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.4.0
Reporter: Yikun Jiang


 
{code:java}
```
starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/pandas/frame.py:9970: error: Need type annotation for 
"raveled_column_labels"  [var-annotated]
Found 1 error in 1 file (checked 337 source files)
``` {code}
{code:java}
== ERROR 
[2.102s]: test_arithmetic_op_exceptions 
(pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest) 
-- 
Traceback (most recent call last): File 
"/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 
99, in test_arithmetic_op_exceptions self.assertRaisesRegex(TypeError, 
expected_err_msg, lambda: other / psser) File 
"/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex return 
context.handle('assertRaisesRegex', args, kwargs) File 
"/usr/lib/python3.9/unittest/case.py", line 201, in handle callable_obj(*args, 
**kwargs) File 
"/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 
99, in  self.assertRaisesRegex(TypeError, expected_err_msg, lambda: 
other / psser) File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, 
in __array_ufunc__ raise NotImplementedError( NotImplementedError: 
pandas-on-Spark objects currently do not support . 
--
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39575) ByteBuffer forget to rewind after get in AvroDeserializer

2022-06-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39575:


Assignee: Frank Wong

> ByteBuffer forget to rewind after get in AvroDeserializer
> -
>
> Key: SPARK-39575
> URL: https://issues.apache.org/jira/browse/SPARK-39575
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Frank Wong
>Assignee: Frank Wong
>Priority: Major
>
> {code:java}
> case (BYTES, BinaryType) => (updater, ordinal, value) =>
>   val bytes = value match {
> case b: ByteBuffer =>
>   val bytes = new Array[Byte](b.remaining)
>   b.get(bytes)
>   // Do not forget to reset the position
>   b.rewind()
>   bytes
> case b: Array[Byte] => b
> case other => throw new RuntimeException(s"$other is not a valid avro 
> binary.")
>   }
>   updater.set(ordinal, bytes) {code}
> After Avro data is converted to InternalRow, there will be redundant position 
> in ByteBuffer type caused by ByteBuffer#get



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39575) ByteBuffer forget to rewind after get in AvroDeserializer

2022-06-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39575.
--
Fix Version/s: 3.4.0
   3.3.1
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36973
[https://github.com/apache/spark/pull/36973]

> ByteBuffer forget to rewind after get in AvroDeserializer
> -
>
> Key: SPARK-39575
> URL: https://issues.apache.org/jira/browse/SPARK-39575
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Frank Wong
>Assignee: Frank Wong
>Priority: Major
> Fix For: 3.3.1, 3.2.2, 3.4.0
>
>
> {code:java}
> case (BYTES, BinaryType) => (updater, ordinal, value) =>
>   val bytes = value match {
> case b: ByteBuffer =>
>   val bytes = new Array[Byte](b.remaining)
>   b.get(bytes)
>   // Do not forget to reset the position
>   b.rewind()
>   bytes
> case b: Array[Byte] => b
> case other => throw new RuntimeException(s"$other is not a valid avro 
> binary.")
>   }
>   updater.set(ordinal, bytes) {code}
> After Avro data is converted to InternalRow, there will be redundant position 
> in ByteBuffer type caused by ByteBuffer#get



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39610) Add safe.directory for container based job

2022-06-26 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39610:
---

 Summary: Add safe.directory for container based job
 Key: SPARK-39610
 URL: https://issues.apache.org/jira/browse/SPARK-39610
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang


{code:java}
```
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
    git config --global --add safe.directory /__w/spark/spark
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
    git config --global --add safe.directory /__w/spark/spark
Error: Process completed with exit code 128.
``` {code}
https://github.blog/2022-04-12-git-security-vulnerability-announced/
[https://github.com/actions/checkout/issues/760]

```yaml
    - name: Github Actions permissions workaround
      run: |
        git config --global --add safe.directory ${GITHUB_WORKSPACE}
```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39609) PySpark need to support pypy3.8 to avoid "No module named '_pickle"

2022-06-26 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39609:
---

 Summary: PySpark need to support pypy3.8 to avoid "No module named 
'_pickle"
 Key: SPARK-39609
 URL: https://issues.apache.org/jira/browse/SPARK-39609
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Yikun Jiang


{code:java}
Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: 
/tmp/pypy3__pyspark.sql.tests.test_arrow__jx96qdzs.log)
Traceback (most recent call last):
  File "/usr/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/pypy3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in 
    from pyspark.rdd import RDD, RDDBarrier
  File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in 
    from pyspark.java_gateway import local_connect_and_auth
  File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in 
    from pyspark.serializers import read_int, write_with_length, 
UTF8Deserializer
  File "/__w/spark/spark/python/pyspark/serializers.py", line 68, in 
    from pyspark import cloudpickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 4, in 

    from pyspark.cloudpickle.cloudpickle import *  # noqa
  File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 57, 
in 
    from .compat import pickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in 

    from _pickle import Pickler  # noqa: F401
ModuleNotFoundError: No module named '_pickle'
Had test failures in pyspark.sql.tests.test_arrow with pypy3; see logs. {code}
Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems 
cloudpickle has a bug.

This may related: 
https://github.com/cloudpipe/cloudpickle/commit/8bbea3e140767f51dd935a3c8f21c9a8e8702b7c,
 but I try to apply this, also failed. Need a deeper look, if you guys know the 
reason of this, pls let me know.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39148) DS V2 aggregate push down can work with OFFSET or LIMIT

2022-06-26 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558943#comment-17558943
 ] 

jiaan.geng commented on SPARK-39148:


I'm working on.

> DS V2 aggregate push down can work with OFFSET or LIMIT
> ---
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
> If it can work with OFFSET or LIMIT, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39148) DS V2 aggregate push down can work with OFFSET or LIMIT

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39148:
---
Description: 
Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
If it can work with OFFSET or LIMIT, it will be better performance.

  was:
Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
If it can OFFSET, it will be better performance.


> DS V2 aggregate push down can work with OFFSET or LIMIT
> ---
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
> If it can work with OFFSET or LIMIT, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39148) DS V2 aggregate push down can work with OFFSET or LIMIT

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39148:
---
Description: 
Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
better performance.

  was:
Currently, DS V2 push-down supports LIMIT alone or OFFSET alone.
If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
better performance.


> DS V2 aggregate push down can work with OFFSET or LIMIT
> ---
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
> If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
> better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation

2022-06-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39574.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36981
[https://github.com/apache/spark/pull/36981]

> Better error message when `ps.Index` is used for DataFrame/Series creation
> --
>
> Key: SPARK-39574
> URL: https://issues.apache.org/jira/browse/SPARK-39574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Better error message when `ps.Index` is used for DataFrame/Series creation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39148) DS V2 aggregate push down can work with OFFSET or LIMIT

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39148:
---
Description: 
Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
If it can OFFSET, it will be better performance.

  was:
Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
better performance.


> DS V2 aggregate push down can work with OFFSET or LIMIT
> ---
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with OFFSET and LIMIT.
> If it can OFFSET, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation

2022-06-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39574:


Assignee: Xinrong Meng

> Better error message when `ps.Index` is used for DataFrame/Series creation
> --
>
> Key: SPARK-39574
> URL: https://issues.apache.org/jira/browse/SPARK-39574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Better error message when `ps.Index` is used for DataFrame/Series creation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39148) DS V2 aggregate push down can work with OFFSET or LIMIT

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39148:
---
Summary: DS V2 aggregate push down can work with OFFSET or LIMIT  (was: 
Support push down OFFSET append LIMIT to JDBC data source V2)

> DS V2 aggregate push down can work with OFFSET or LIMIT
> ---
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 push-down supports LIMIT alone or OFFSET alone.
> If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
> better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-39148) Support push down OFFSET append LIMIT to JDBC data source V2

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng reopened SPARK-39148:


> Support push down OFFSET append LIMIT to JDBC data source V2
> 
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 push-down supports LIMIT alone or OFFSET alone.
> If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
> better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-39148) Support push down OFFSET append LIMIT to JDBC data source V2

2022-06-26 Thread jiaan.geng (Jira)



[ https://issues.apache.org/jira/browse/SPARK-39148 ]


jiaan.geng deleted comment on SPARK-39148:


was (Author: beliefer):
I'm working on.

> Support push down OFFSET append LIMIT to JDBC data source V2
> 
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 push-down supports LIMIT alone or OFFSET alone.
> If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
> better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39148) Support push down OFFSET append LIMIT to JDBC data source V2

2022-06-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-39148.

Resolution: Duplicate

> Support push down OFFSET append LIMIT to JDBC data source V2
> 
>
> Key: SPARK-39148
> URL: https://issues.apache.org/jira/browse/SPARK-39148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 push-down supports LIMIT alone or OFFSET alone.
> If we can pushing down OFFSET append LIMIT to JDBC data source, it will be 
> better performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39608) Upgrade to spark 3.3.0 is causing error "Cannot grow BufferHolder by size -179446840 because the size is negative"

2022-06-26 Thread Isaac Eliassi (Jira)

Isaac Eliassi created SPARK-39608:
-

 Summary: Upgrade to spark 3.3.0 is causing error "Cannot grow 
BufferHolder by size -179446840 because the size is negative"
 Key: SPARK-39608
 URL: https://issues.apache.org/jira/browse/SPARK-39608
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Isaac Eliassi


Hi,

 

We recently upgraded to version 3.3.0.
The upgrade is causing the following error "Cannot grow BufferHolder by size 
-179446840 because the size is negative"

 

I can't find information on this on the internet, when reverting to spark 3.2.1 
it works.

 

Full exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 36.0 failed 4 times, most recent failure: Lost task 1.3 in stage 36.0 
(TID 2873) (172.24.214.133 executor 4): java.lang.IllegalArgumentException: 
Cannot grow BufferHolder by size -143657042 because the size is negative
        at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:63)
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:165)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage24.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage24.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
        at 
org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
        at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
        at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
        at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
        at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
        at

[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql

2022-06-26 Thread SAVIO SALVARINO TELES DE OLIVEIRA (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558879#comment-17558879
 ] 

SAVIO SALVARINO TELES DE OLIVEIRA commented on SPARK-38288:
---

I have the same problem with Spark 3.2.1.

Code to read orders table from PostgreSQL:

 
{code:java}
orders = spark. \
            read. \
            format('jdbc'). \
            option('url', 'jdbc:postgresql://...'). \
            option('driver', 'org.postgresql.Driver'). \
            option('dbtable', 'orders'). \
            option('user', '***'). \
            option('password', ''). \
            option('pushDownAggregate', 'true'). \
            load(){code}
 

Now, I'm trying to group by two columns (owner_name and client_id):
{code:java}
orders.groupby("owner_name", 
"client_id").agg(max('order_date').alias("max_order_date")).limit(10).explain('extended'){code}
 

But the query execution is still using the Relation instead RelationV2:

 
{code:java}
== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Aggregate [owner_name#2717, client_id#2724], [owner_name#2717, 
client_id#2724, max(order_date#2728) AS max_order_date#2910]
  +- Project [owner_name#2717, client_id#2724, order_date#2728]
 +- Relation [owner_name#2717,client_id#2724,order_date#2728] 
JDBCRelation(orders) [numPartitions=1]{code}
 

> Aggregate push down doesnt work using Spark SQL jdbc datasource with 
> postgresql
> ---
>
> Key: SPARK-38288
> URL: https://issues.apache.org/jira/browse/SPARK-38288
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Luis Lozano Coira
>Priority: Major
>  Labels: DataSource, Spark-SQL
>
> I am establishing a connection with postgresql using the Spark SQL jdbc 
> datasource. I have started the spark shell including the postgres driver and 
> I can connect and execute queries without problems. I am using this statement:
> {code:java}
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:postgresql://host:port/").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").option("user", 
> "postgres").option("password", 
> "***").option("pushDownAggregate",true).load()
> {code}
> I am adding the pushDownAggregate option because I would like the 
> aggregations are delegated to the source. But for some reason this is not 
> happening.
> Reviewing this pull request, it seems that this feature should be merged into 
> 3.2. [https://github.com/apache/spark/pull/29695]
> I am making the aggregations considering the mentioned limitations. An 
> example case where I don't see pushdown being done would be this one:
> {code:java}
> df.groupBy("name").max("age").show()
> {code}
> The results of the queryExecution are shown below:
> {code:java}
> scala> df.groupBy("name").max("age").queryExecution.executedPlan
> res19: org.apache.spark.sql.execution.SparkPlan =
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, 
> max(age)#544])
>+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205]
>   +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], 
> output=[name#274, max#548])
>  +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] 
> PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: 
> struct
> scala> dfp.groupBy("name").max("age").queryExecution.toString
> res20: String =
> "== Parsed Logical Plan ==
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#246] JDBCRelation(test) [numPartitions=1]
> == Analyzed Logical Plan ==
> name: string, max(age): int
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#24...
> {code}
> What could be the problem? Should pushDownAggregate work in this case?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39599) Upgrade maven to 3.8.6

2022-06-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39599:


Assignee: Yang Jie

> Upgrade maven to 3.8.6
> --
>
> Key: SPARK-39599
> URL: https://issues.apache.org/jira/browse/SPARK-39599
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://maven.apache.org/docs/3.8.5/release-notes.html]
> https://maven.apache.org/docs/3.8.6/release-notes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39599) Upgrade maven to 3.8.6

2022-06-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39599.
--
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 36978
[https://github.com/apache/spark/pull/36978]

> Upgrade maven to 3.8.6
> --
>
> Key: SPARK-39599
> URL: https://issues.apache.org/jira/browse/SPARK-39599
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.1, 3.4.0
>
>
> [https://maven.apache.org/docs/3.8.5/release-notes.html]
> https://maven.apache.org/docs/3.8.6/release-notes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39607) DataSourceV2: Distribution and ordering support V2 function in writing

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558873#comment-17558873
 ] 

Apache Spark commented on SPARK-39607:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/36995

> DataSourceV2: Distribution and ordering support V2 function in writing
> --
>
> Key: SPARK-39607
> URL: https://issues.apache.org/jira/browse/SPARK-39607
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39607) DataSourceV2: Distribution and ordering support V2 function in writing

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39607:


Assignee: Apache Spark

> DataSourceV2: Distribution and ordering support V2 function in writing
> --
>
> Key: SPARK-39607
> URL: https://issues.apache.org/jira/browse/SPARK-39607
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39607) DataSourceV2: Distribution and ordering support V2 function in writing

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558872#comment-17558872
 ] 

Apache Spark commented on SPARK-39607:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/36995

> DataSourceV2: Distribution and ordering support V2 function in writing
> --
>
> Key: SPARK-39607
> URL: https://issues.apache.org/jira/browse/SPARK-39607
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39607) DataSourceV2: Distribution and ordering support V2 function in writing

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39607:


Assignee: (was: Apache Spark)

> DataSourceV2: Distribution and ordering support V2 function in writing
> --
>
> Key: SPARK-39607
> URL: https://issues.apache.org/jira/browse/SPARK-39607
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39607) DataSourceV2: Distribution and ordering support V2 function in writing

2022-06-26 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-39607:
-

 Summary: DataSourceV2: Distribution and ordering support V2 
function in writing
 Key: SPARK-39607
 URL: https://issues.apache.org/jira/browse/SPARK-39607
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39606) Use child stats to estimate order operator

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39606:


Assignee: (was: Apache Spark)

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39606) Use child stats to estimate order operator

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558867#comment-17558867
 ] 

Apache Spark commented on SPARK-39606:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36994

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39606) Use child stats to estimate order operator

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39606:


Assignee: Apache Spark

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39606) Use child stats to estimate order operator

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558866#comment-17558866
 ] 

Apache Spark commented on SPARK-39606:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36994

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39606) Use child stats to estimate order operator

2022-06-26 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-39606:
---

 Summary: Use child stats to estimate order operator
 Key: SPARK-39606
 URL: https://issues.apache.org/jira/browse/SPARK-39606
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39605:

Fix Version/s: (was: 3.0.1)

> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39605:

Target Version/s:   (was: 3.2.1)

> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Chandrashekar updated SPARK-39605:

Description: 
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

And I have validated that there is no field in our schema that has NullType. In 
fact when the schema was inferred, there were Null & void type fields which 
were converted to string using my UDF. This issue will persists even when I 
infer schema on complete dataset, that is, samplePoolSize is on full data set.

  was:
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

And I have validated that there is no field in our schema that has NullType. In 
fact when the schema was inferred, there were Null & void type fields which 
were converted to string. This issue will persists even when we infer schema on 
complete dataset, that is, samplePoolSize is on full data set.


> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
> Fix For: 3.0.1
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Chandrashekar updated SPARK-39605:

Description: 
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

And I have validated that there is no field in our schema that has NullType. In 
fact when the schema was inferred, there were Null & void type fields which 
were converted to string. This issue will persists even when we infer schema on 
complete dataset, that is, samplePoolSize is on full data set.

  was:
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

 


> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
> Fix For: 3.0.1
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string. This issue will persists even when we infer 
> schema on complete dataset, that is, samplePoolSize is on full data set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Chandrashekar updated SPARK-39605:

Description: 
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

*Below is the image that shows successful run in 7.3 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

*Below is the image that shows failure in 10.4 LTS:*

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

 

  was:
I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

Below is the image that shows successful run in 7.3 LTS:

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

Below is the image that shows failure in 10.4 LTS:

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

 


> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> 
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Manoj Chandrashekar
>Priority: Major
> Fix For: 3.0.1
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!
> *Below is the image that shows failure in 10.4 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

2022-06-26 Thread Manoj Chandrashekar (Jira)

Manoj Chandrashekar created SPARK-39605:
---

 Summary: PySpark df.count() operation works fine on DBR 7.3 LTS 
but fails in DBR 10.4 LTS
 Key: SPARK-39605
 URL: https://issues.apache.org/jira/browse/SPARK-39605
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Manoj Chandrashekar
 Fix For: 3.0.1


I have a job that infers schema from mongodb and does operations such as 
flattening and unwinding because there are nested fields. After performing 
various transformations, finally when the count() is performed, it works 
perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
same in 10.4 LTS.

Below is the image that shows successful run in 7.3 LTS:

!https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=672,height=80!

Below is the image that shows failure in 10.4 LTS:

!https://docs.microsoft.com/answers/storage/attachments/215026-image.png|width=668,height=69!

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread eugene (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558857#comment-17558857
 ] 

eugene commented on SPARK-39604:


Issued https://github.com/apache/spark/pull/36993 for the ticket.

> Miss UT for DerbyDialet's getCatalystType
> -
>
> Key: SPARK-39604
> URL: https://issues.apache.org/jira/browse/SPARK-39604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: eugene
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39604:


Assignee: (was: Apache Spark)

> Miss UT for DerbyDialet's getCatalystType
> -
>
> Key: SPARK-39604
> URL: https://issues.apache.org/jira/browse/SPARK-39604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: eugene
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558858#comment-17558858
 ] 

Apache Spark commented on SPARK-39604:
--

User 'Eugene-Mark' has created a pull request for this issue:
https://github.com/apache/spark/pull/36993

> Miss UT for DerbyDialet's getCatalystType
> -
>
> Key: SPARK-39604
> URL: https://issues.apache.org/jira/browse/SPARK-39604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: eugene
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39604:


Assignee: Apache Spark

> Miss UT for DerbyDialet's getCatalystType
> -
>
> Key: SPARK-39604
> URL: https://issues.apache.org/jira/browse/SPARK-39604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: eugene
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558856#comment-17558856
 ] 

Apache Spark commented on SPARK-39604:
--

User 'Eugene-Mark' has created a pull request for this issue:
https://github.com/apache/spark/pull/36993

> Miss UT for DerbyDialet's getCatalystType
> -
>
> Key: SPARK-39604
> URL: https://issues.apache.org/jira/browse/SPARK-39604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: eugene
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39604) Miss UT for DerbyDialet's getCatalystType

2022-06-26 Thread eugene (Jira)

eugene created SPARK-39604:
--

 Summary: Miss UT for DerbyDialet's getCatalystType
 Key: SPARK-39604
 URL: https://issues.apache.org/jira/browse/SPARK-39604
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: eugene






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

58 matches

Mail list logo