[jira] [Assigned] (SPARK-39285) Spark should not check filed name when read data

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39285:


Assignee: (was: Apache Spark)

> Spark should not check filed name when read data
> 
>
> Key: SPARK-39285
> URL: https://issues.apache.org/jira/browse/SPARK-39285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> Spark should not check read data when read data



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39285) Spark should not check filed name when read data

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541850#comment-17541850
 ] 

Apache Spark commented on SPARK-39285:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36661

> Spark should not check filed name when read data
> 
>
> Key: SPARK-39285
> URL: https://issues.apache.org/jira/browse/SPARK-39285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> Spark should not check read data when read data



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39285) Spark should not check filed name when read data

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39285:


Assignee: Apache Spark

> Spark should not check filed name when read data
> 
>
> Key: SPARK-39285
> URL: https://issues.apache.org/jira/browse/SPARK-39285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Spark should not check read data when read data



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39285) Spark should not check filed name when read data

2022-05-24 Thread angerszhu (Jira)
angerszhu created SPARK-39285:
-

 Summary: Spark should not check filed name when read data
 Key: SPARK-39285
 URL: https://issues.apache.org/jira/browse/SPARK-39285
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: angerszhu


Spark should not check read data when read data



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39284) Implement Groupby.mad

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541835#comment-17541835
 ] 

Apache Spark commented on SPARK-39284:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/36660

> Implement Groupby.mad
> -
>
> Key: SPARK-39284
> URL: https://issues.apache.org/jira/browse/SPARK-39284
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39284) Implement Groupby.mad

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39284:


Assignee: (was: Apache Spark)

> Implement Groupby.mad
> -
>
> Key: SPARK-39284
> URL: https://issues.apache.org/jira/browse/SPARK-39284
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39284) Implement Groupby.mad

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39284:


Assignee: Apache Spark

> Implement Groupby.mad
> -
>
> Key: SPARK-39284
> URL: https://issues.apache.org/jira/browse/SPARK-39284
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39284) Implement Groupby.mad

2022-05-24 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-39284:


 Summary: Implement Groupby.mad
 Key: SPARK-39284
 URL: https://issues.apache.org/jira/browse/SPARK-39284
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread xiangxiang Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541805#comment-17541805
 ] 

xiangxiang Shen commented on SPARK-39282:
-

CC [~ueshin] ,[~cloud_fan] , Thanks!

> Replace If-Else branch with bitwise operators in 
> roundNumberOfBytesToNearestWord
> 
>
> Key: SPARK-39282
> URL: https://issues.apache.org/jira/browse/SPARK-39282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xiangxiang Shen
>Priority: Major
>
> [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]
>  
> Here,can use bitwise operators to avoid If-Else branch and improve 
> computation performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541804#comment-17541804
 ] 

Apache Spark commented on SPARK-39282:
--

User 'zhixingheyi-tian' has created a pull request for this issue:
https://github.com/apache/spark/pull/36659

> Replace If-Else branch with bitwise operators in 
> roundNumberOfBytesToNearestWord
> 
>
> Key: SPARK-39282
> URL: https://issues.apache.org/jira/browse/SPARK-39282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xiangxiang Shen
>Priority: Major
>
> [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]
>  
> Here,can use bitwise operators to avoid If-Else branch and improve 
> computation performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Description: 
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :

!DeadlockSparkTasks.png!

 

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 

  was:
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :

 

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 


> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !DeadlockSparkTasks.png!
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Attachment: DeadlockSparkTasks.png

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Affects Version/s: 3.1.2
   (was: 3.0.0)

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !DeadlockSparkTasks.png!
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541803#comment-17541803
 ] 

Apache Spark commented on SPARK-39282:
--

User 'zhixingheyi-tian' has created a pull request for this issue:
https://github.com/apache/spark/pull/36659

> Replace If-Else branch with bitwise operators in 
> roundNumberOfBytesToNearestWord
> 
>
> Key: SPARK-39282
> URL: https://issues.apache.org/jira/browse/SPARK-39282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xiangxiang Shen
>Priority: Major
>
> [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]
>  
> Here,can use bitwise operators to avoid If-Else branch and improve 
> computation performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39282:


Assignee: Apache Spark

> Replace If-Else branch with bitwise operators in 
> roundNumberOfBytesToNearestWord
> 
>
> Key: SPARK-39282
> URL: https://issues.apache.org/jira/browse/SPARK-39282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xiangxiang Shen
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]
>  
> Here,can use bitwise operators to avoid If-Else branch and improve 
> computation performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39282:


Assignee: (was: Apache Spark)

> Replace If-Else branch with bitwise operators in 
> roundNumberOfBytesToNearestWord
> 
>
> Key: SPARK-39282
> URL: https://issues.apache.org/jira/browse/SPARK-39282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xiangxiang Shen
>Priority: Major
>
> [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]
>  
> Here,can use bitwise operators to avoid If-Else branch and improve 
> computation performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Description: 
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :

 

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 

  was:
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :



!DeadlockSparkTasks.png!

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 


> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Attachment: (was: DeadlockSparkTasks.png)

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Labels: Deadlock spark3.0  (was: )

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !DeadlockSparkTasks.png!
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Description: 
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :



!DeadlockSparkTasks.png!

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 

  was:
We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :
!image-2022-05-24-20-03-35-287.png!

 

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 


> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !DeadlockSparkTasks.png!
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)
Sandeep Pal created SPARK-39283:
---

 Summary: Spark tasks stuck forever due to deadlock between 
TaskMemoryManager and UnsafeExternalSorter
 Key: SPARK-39283
 URL: https://issues.apache.org/jira/browse/SPARK-39283
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sandeep Pal
 Attachments: DeadlockSparkTasks.png

We are seems this deadlock between {{TaskMemoryManager}} and 
{{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
successful but sometimes we have to do hacky ways to break the deadlocks such 
as turning down the worker machines explicitly. 

Below is the thread dump from the Spark UI showing the deadlock :
!image-2022-05-24-20-03-35-287.png!

 

I believe there was a related Jira on the similar deadlock between the same 
threads and it was resolved. 
https://issues.apache.org/jira/browse/SPARK-27338

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-05-24 Thread Sandeep Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-39283:

Attachment: DeadlockSparkTasks.png

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Pal
>Priority: Critical
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !image-2022-05-24-20-03-35-287.png!
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord

2022-05-24 Thread xiangxiang Shen (Jira)
xiangxiang Shen created SPARK-39282:
---

 Summary: Replace If-Else branch with bitwise operators in 
roundNumberOfBytesToNearestWord
 Key: SPARK-39282
 URL: https://issues.apache.org/jira/browse/SPARK-39282
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: xiangxiang Shen


[https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47]

 

Here,can use bitwise operators to avoid If-Else branch and improve computation 
performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39281) Fasten Timestamp type inference of legacy format in JSON/CSV data source

2022-05-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-39281:
--

 Summary: Fasten Timestamp type inference of legacy format in 
JSON/CSV data source
 Key: SPARK-39281
 URL: https://issues.apache.org/jira/browse/SPARK-39281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39280) Fasten Timestamp type inference with user-provided format in JSON/CSV data source

2022-05-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-39280:
--

 Summary: Fasten Timestamp type inference with user-provided format 
in JSON/CSV data source
 Key: SPARK-39280
 URL: https://issues.apache.org/jira/browse/SPARK-39280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39193) Fasten Timestamp type inference of default format in JSON/CSV data source

2022-05-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39193:
---
Parent: SPARK-39279
Issue Type: Sub-task  (was: Improvement)

> Fasten Timestamp type inference of default format in JSON/CSV data source
> -
>
> Key: SPARK-39193
> URL: https://issues.apache.org/jira/browse/SPARK-39193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> When reading JSON/CSV files with inferring timestamp types 
> `.option("inferTimestamp", true)`, the Timestamp conversion will throw and 
> catch exceptions. As we are putting decent error messages in the exception, 
> the creation of the exceptions is actually not cheap. It consumes more than 
> 90% of the type inference time. 
> We can use the parsing methods which return optional results instead.
> Before the change, it takes 166 seconds to infer a JSON file of 624MB with 
> inferring timestamp enabled.
> After the change, it only 16 seconds.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39193) Fasten Timestamp type inference of default format in JSON/CSV data source

2022-05-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39193:
---
Summary: Fasten Timestamp type inference of default format in JSON/CSV data 
source  (was: Improve the performance of inferring Timestamp type in JSON/CSV 
data source)

> Fasten Timestamp type inference of default format in JSON/CSV data source
> -
>
> Key: SPARK-39193
> URL: https://issues.apache.org/jira/browse/SPARK-39193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> When reading JSON/CSV files with inferring timestamp types 
> `.option("inferTimestamp", true)`, the Timestamp conversion will throw and 
> catch exceptions. As we are putting decent error messages in the exception, 
> the creation of the exceptions is actually not cheap. It consumes more than 
> 90% of the type inference time. 
> We can use the parsing methods which return optional results instead.
> Before the change, it takes 166 seconds to infer a JSON file of 624MB with 
> inferring timestamp enabled.
> After the change, it only 16 seconds.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39279) Fasten the schema inference of CSV/JSON data source

2022-05-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-39279:
--

 Summary: Fasten the schema inference of CSV/JSON data source
 Key: SPARK-39279
 URL: https://issues.apache.org/jira/browse/SPARK-39279
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang


In the current implementation of CSV/JSON data source, the schema inference 
relies on methods that will throw exceptions if the fields can't convert as 
some data types. 

Throwing and catching exceptions can be slow. We can improve it by creating 
methods that return optional results instead. A good example is 
[https://github.com/apache/spark/pull/36562|https://github.com/apache/spark/pull/36562,],
 which reduces the schema inference time by 90%.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39252.
--
Fix Version/s: 3.1.3
   3.2.2
   3.3.1
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/36656

> Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
> 
>
> Key: SPARK-39252
> URL: https://issues.apache.org/jira/browse/SPARK-39252
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.3, 3.2.2, 3.3.1
>
>
> {{test_df_is_empty}} is flaky. For example, a recent PR: 
> https://github.com/apache/spark/pull/36580
> https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true
> Possibly introduced from SPARK-39084
> {code}
> test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... 
> [Stage 6:>  (0 + 1) / 
> 1]
>   
>   
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) 
> (8.0_332-b09) (build 1.8.0_332-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 9116 C2 
> org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 
> bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f]
> #
> # Core dump written. Default location: /__w/spark/spark/core or core.4021
> #
> # An error report file with more information is saved as:
> # /__w/spark/spark/hs_err_pid4021.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://www.azul.com/support/
> #
> 
> Exception happened during processing of request from ('127.0.0.1', 36358)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in 
> _handle_request_noblock
> self.process_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in 
> process_request
> self.finish_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__
> self.handle()
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle
> poll(accum_updates)
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll
> if func():
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in 
> accum_updates
> num_updates = read_int(self.rfile)
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int
> raise EOFError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39252:


Assignee: Ivan Sadikov

> Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
> 
>
> Key: SPARK-39252
> URL: https://issues.apache.org/jira/browse/SPARK-39252
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Hyukjin Kwon
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.1.3, 3.2.2, 3.3.1
>
>
> {{test_df_is_empty}} is flaky. For example, a recent PR: 
> https://github.com/apache/spark/pull/36580
> https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true
> Possibly introduced from SPARK-39084
> {code}
> test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... 
> [Stage 6:>  (0 + 1) / 
> 1]
>   
>   
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) 
> (8.0_332-b09) (build 1.8.0_332-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 9116 C2 
> org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 
> bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f]
> #
> # Core dump written. Default location: /__w/spark/spark/core or core.4021
> #
> # An error report file with more information is saved as:
> # /__w/spark/spark/hs_err_pid4021.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://www.azul.com/support/
> #
> 
> Exception happened during processing of request from ('127.0.0.1', 36358)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in 
> _handle_request_noblock
> self.process_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in 
> process_request
> self.finish_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__
> self.handle()
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle
> poll(accum_updates)
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll
> if func():
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in 
> accum_updates
> num_updates = read_int(self.rfile)
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int
> raise EOFError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39278:


Assignee: Apache Spark

> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39278:


Assignee: (was: Apache Spark)

> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541779#comment-17541779
 ] 

Apache Spark commented on SPARK-39278:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36658

> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541781#comment-17541781
 ] 

Apache Spark commented on SPARK-39278:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36658

> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39220) codegen cause NullPointException

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39220.
--
Resolution: Cannot Reproduce

> codegen cause NullPointException
> 
>
> Key: SPARK-39220
> URL: https://issues.apache.org/jira/browse/SPARK-39220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.6, 3.2.1
>Reporter: chenxusheng
>Priority: Major
>
> The following code raises NullPointException
> {code:sql}
> SELECT
>   fk4c7a8cfc,
>   fka54f2a73,
>   fk37e266f7
> FROM
>   be2a04fad4a24848bee641825e5b3466
> WHERE
>   (
>     fk4c7a8cfc is not null
>     and fk4c7a8cfc<> ''
>   )
> LIMIT
>   1000
> {code}
> However, if so, it is normal
> {code:sql}
> SELECT
>   fk4c7a8cfc,
>   fka54f2a73,
>   fk37e266f7
> FROM
>   be2a04fad4a24848bee641825e5b3466
> WHERE
>   (
>     fk4c7a8cfc is not null
>     and '' <> fk4c7a8cfc
>   )
> LIMIT
>   1000
> {code}
> I just put the '' in where in front.
> The reason for this problem is that the data contains null values.
> *_org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext#genEqual_*
> {code:scala}
>   def genEqual(dataType: DataType, c1: String, c2: String): String = dataType 
> match {
> case BinaryType => s"java.util.Arrays.equals($c1, $c2)"
> case FloatType =>
>   s"((java.lang.Float.isNaN($c1) && java.lang.Float.isNaN($c2)) || $c1 == 
> $c2)"
> case DoubleType =>
>   s"((java.lang.Double.isNaN($c1) && java.lang.Double.isNaN($c2)) || $c1 
> == $c2)"
> case dt: DataType if isPrimitiveType(dt) => s"$c1 == $c2"
> case dt: DataType if dt.isInstanceOf[AtomicType] => s"$c1.equals($c2)"
> case array: ArrayType => genComp(array, c1, c2) + " == 0"
> case struct: StructType => genComp(struct, c1, c2) + " == 0"
> case udt: UserDefinedType[_] => genEqual(udt.sqlType, c1, c2)
> case NullType => "false"
> case _ =>
>   throw new IllegalArgumentException(
> "cannot generate equality code for un-comparable type: " + 
> dataType.catalogString)
>   }
> {code}
> {code:scala}
> case dt: DataType if dt.isInstanceOf[AtomicType] => s"$c1.equals($c2)"
> {code}
> Missing null value judgment?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39274) AttributeError: 'datetime.time' object has no attribute 'timetuple'

2022-05-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541773#comment-17541773
 ] 

Hyukjin Kwon commented on SPARK-39274:
--

We don;t current'y corresponding mapping of datetime.time in PySpark <> Spark 
SQL.

> AttributeError: 'datetime.time' object has no attribute 'timetuple'
> ---
>
> Key: SPARK-39274
> URL: https://issues.apache.org/jira/browse/SPARK-39274
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Andreas Fried
>Priority: Major
>
>  
> {code:java}
> import pandas as pd
> import datetime
> pdf = pd.DataFrame({'naive': [datetime.time(11, 30, 33, 0)]})
> print(pdf)
> print(pdf.info())
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sp_df2 = spark.createDataFrame(pdf)
> sp_df2.show(10)
> {code}
>  
> throws this error:
>  
> {code:java}
>   naive
> 0  11:30:33
> 
> RangeIndex: 1 entries, 0 to 0
> Data columns (total 1 columns):
>  #   Column  Non-Null Count  Dtype 
> ---  --  --  - 
>  0   naive   1 non-null  object
> dtypes: object(1)
> memory usage: 136.0+ bytes
> None
> ---
> AttributeErrorTraceback (most recent call last)
> /usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py in 
> 
>  10 spark = SparkSession.builder.getOrCreate()
>  11 
> ---> 12 sp_df2 = spark.createDataFrame(pdf)
>  13 sp_df2.show(10)
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
> createDataFrame(self, data, schema, samplingRatio, verifySchema)
> 671 if has_pandas and isinstance(data, pandas.DataFrame):
> 672 # Create a DataFrame from pandas DataFrame.
> --> 673 return super(SparkSession, self).createDataFrame(
> 674 data, schema, samplingRatio, verifySchema)
> 675 return self._create_dataframe(data, schema, samplingRatio, 
> verifySchema)
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py in 
> createDataFrame(self, data, schema, samplingRatio, verifySchema)
> 338 raise
> 339 data = self._convert_from_pandas(data, schema, timezone)
> --> 340 return self._create_dataframe(data, schema, samplingRatio, 
> verifySchema)
> 341 
> 342 def _convert_from_pandas(self, pdf, schema, timezone):
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
> _create_dataframe(self, data, schema, samplingRatio, verifySchema)
> 698 rdd, schema = self._createFromRDD(data.map(prepare), 
> schema, samplingRatio)
> 699 else:
> --> 700 rdd, schema = self._createFromLocal(map(prepare, data), 
> schema)
> 701 jrdd = 
> self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
> 702 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
> schema.json())
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
> _createFromLocal(self, data, schema)
> 523 
> 524 # convert python objects to sql data
> --> 525 data = [schema.toInternal(row) for row in data]
> 526 return self._sc.parallelize(data), schema
> 527 
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in (.0)
> 523 
> 524 # convert python objects to sql data
> --> 525 data = [schema.toInternal(row) for row in data]
> 526 return self._sc.parallelize(data), schema
> 527 
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in 
> toInternal(self, obj)
> 624  for n, f, c in zip(self.names, 
> self.fields, self._needConversion))
> 625 elif isinstance(obj, (tuple, list)):
> --> 626 return tuple(f.toInternal(v) if c else v
> 627  for f, v, c in zip(self.fields, obj, 
> self._needConversion))
> 628 elif hasattr(obj, "__dict__"):
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in (.0)
> 624  for n, f, c in zip(self.names, 
> self.fields, self._needConversion))
> 625 elif isinstance(obj, (tuple, list)):
> --> 626 return tuple(f.toInternal(v) if c else v
> 627  for f, v, c in zip(self.fields, obj, 
> self._needConversion))
> 628 elif hasattr(obj, "__dict__"):
> /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in 
> toInternal(self, obj)
> 449 
> 450 def toInternal(self, obj):
> --> 451 return self.dataType.toInternal(obj)
> 452 
> 453 def fromInternal(self, obj)

[jira] [Updated] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-39278:
---
Description: 
Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR will break backward compatibility 
and cause application failure.

  was:
Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR breaks backward compatibility and 
cause application failure.


> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-39278:
--

 Summary: Alternative configs of Hadoop Filesystems to access break 
backward compatibility
 Key: SPARK-39278
 URL: https://issues.apache.org/jira/browse/SPARK-39278
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Manu Zhang


Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR breaks backward compatibility and 
cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39277) Make Optimizer extends SQLConfHelper

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39277:


Assignee: (was: Apache Spark)

> Make Optimizer extends SQLConfHelper
> 
>
> Key: SPARK-39277
> URL: https://issues.apache.org/jira/browse/SPARK-39277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39277) Make Optimizer extends SQLConfHelper

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39277:


Assignee: Apache Spark

> Make Optimizer extends SQLConfHelper
> 
>
> Key: SPARK-39277
> URL: https://issues.apache.org/jira/browse/SPARK-39277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39277) Make Optimizer extends SQLConfHelper

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541770#comment-17541770
 ] 

Apache Spark commented on SPARK-39277:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36657

> Make Optimizer extends SQLConfHelper
> 
>
> Key: SPARK-39277
> URL: https://issues.apache.org/jira/browse/SPARK-39277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39277) Make Optimizer extends SQLConfHelper

2022-05-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-39277:
---

 Summary: Make Optimizer extends SQLConfHelper
 Key: SPARK-39277
 URL: https://issues.apache.org/jira/browse/SPARK-39277
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39273) Make PandasOnSparkTestCase inherit ReusedSQLTestCase

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39273.
--
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36652
[https://github.com/apache/spark/pull/36652]

> Make PandasOnSparkTestCase inherit ReusedSQLTestCase
> 
>
> Key: SPARK-39273
> URL: https://issues.apache.org/jira/browse/SPARK-39273
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> PandasOnSparkTestCase has some legacy codes e.g., not stopping in stop 
> {{tearDownClass}}. We don't need such logic anymore. That logic was when the 
> codes are in Koalas repo that runs the tests sequentially. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39273) Make PandasOnSparkTestCase inherit ReusedSQLTestCase

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39273:


Assignee: Hyukjin Kwon

> Make PandasOnSparkTestCase inherit ReusedSQLTestCase
> 
>
> Key: SPARK-39273
> URL: https://issues.apache.org/jira/browse/SPARK-39273
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> PandasOnSparkTestCase has some legacy codes e.g., not stopping in stop 
> {{tearDownClass}}. We don't need such logic anymore. That logic was when the 
> codes are in Koalas repo that runs the tests sequentially. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39053.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36391
[https://github.com/apache/spark/pull/36391]

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-05-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39053:


Assignee: Yikun Jiang

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39252:


Assignee: Apache Spark

> Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
> 
>
> Key: SPARK-39252
> URL: https://issues.apache.org/jira/browse/SPARK-39252
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {{test_df_is_empty}} is flaky. For example, a recent PR: 
> https://github.com/apache/spark/pull/36580
> https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true
> Possibly introduced from SPARK-39084
> {code}
> test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... 
> [Stage 6:>  (0 + 1) / 
> 1]
>   
>   
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) 
> (8.0_332-b09) (build 1.8.0_332-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 9116 C2 
> org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 
> bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f]
> #
> # Core dump written. Default location: /__w/spark/spark/core or core.4021
> #
> # An error report file with more information is saved as:
> # /__w/spark/spark/hs_err_pid4021.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://www.azul.com/support/
> #
> 
> Exception happened during processing of request from ('127.0.0.1', 36358)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in 
> _handle_request_noblock
> self.process_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in 
> process_request
> self.finish_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__
> self.handle()
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle
> poll(accum_updates)
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll
> if func():
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in 
> accum_updates
> num_updates = read_int(self.rfile)
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int
> raise EOFError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39252:


Assignee: (was: Apache Spark)

> Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
> 
>
> Key: SPARK-39252
> URL: https://issues.apache.org/jira/browse/SPARK-39252
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {{test_df_is_empty}} is flaky. For example, a recent PR: 
> https://github.com/apache/spark/pull/36580
> https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true
> Possibly introduced from SPARK-39084
> {code}
> test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... 
> [Stage 6:>  (0 + 1) / 
> 1]
>   
>   
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) 
> (8.0_332-b09) (build 1.8.0_332-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 9116 C2 
> org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 
> bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f]
> #
> # Core dump written. Default location: /__w/spark/spark/core or core.4021
> #
> # An error report file with more information is saved as:
> # /__w/spark/spark/hs_err_pid4021.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://www.azul.com/support/
> #
> 
> Exception happened during processing of request from ('127.0.0.1', 36358)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in 
> _handle_request_noblock
> self.process_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in 
> process_request
> self.finish_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__
> self.handle()
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle
> poll(accum_updates)
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll
> if func():
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in 
> accum_updates
> num_updates = read_int(self.rfile)
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int
> raise EOFError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541751#comment-17541751
 ] 

Apache Spark commented on SPARK-39252:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36656

> Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
> 
>
> Key: SPARK-39252
> URL: https://issues.apache.org/jira/browse/SPARK-39252
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {{test_df_is_empty}} is flaky. For example, a recent PR: 
> https://github.com/apache/spark/pull/36580
> https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true
> Possibly introduced from SPARK-39084
> {code}
> test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... 
> [Stage 6:>  (0 + 1) / 
> 1]
>   
>   
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) 
> (8.0_332-b09) (build 1.8.0_332-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 9116 C2 
> org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 
> bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f]
> #
> # Core dump written. Default location: /__w/spark/spark/core or core.4021
> #
> # An error report file with more information is saved as:
> # /__w/spark/spark/hs_err_pid4021.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://www.azul.com/support/
> #
> 
> Exception happened during processing of request from ('127.0.0.1', 36358)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in 
> _handle_request_noblock
> self.process_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in 
> process_request
> self.finish_request(request, client_address)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__
> self.handle()
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle
> poll(accum_updates)
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll
> if func():
>   File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in 
> accum_updates
> num_updates = read_int(self.rfile)
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int
> raise EOFError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39276) grouping_id() behavior changed between 3.1.x and 3.2.x

2022-05-24 Thread Martin Price (Jira)
Martin Price created SPARK-39276:


 Summary: grouping_id() behavior changed between 3.1.x and 3.2.x
 Key: SPARK-39276
 URL: https://issues.apache.org/jira/browse/SPARK-39276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Martin Price


It appears that Spark 3.1.x respected the order of columns in the `group by` 
clause to determine what the each bit in "grouping_id()" referred to.

In spark 3.2.x it appears to use the order columns first appear in the 
`grouping sets` clause.

We use the grouping_id() to direct different levels of aggregation to different 
tables, so this change in behavior resulted in those pipelines breaking.

3.1.3 behavior:

The grouping_id bitmaps between the two queries are the same:
{noformat}
--
Start test: Grouping sets in same order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col1),
    (col2, col3)
)

++++---++
|col1|col2|col3|grouping_id|rowCount|
++++---++
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
++++---++
Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in same order as group by

--
Start test: Grouping sets in different order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col2, col3),
    (col1)
)

++++---++
|col1|col2|col3|grouping_id|rowCount|
++++---++
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
++++---++
Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in different order as group by{noformat}

# 3.2.1 output

The grouping_id bitmap changes between the two queries based on the order 
columns appear in the grouping sets clause.
{noformat}

--
Start test: Grouping sets in same order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col1),
    (col2, col3)
)

++++---++
|col1|col2|col3|grouping_id|rowCount|
++++---++
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
++++---++

Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in same order as group by

--
Start test: Grouping sets in different order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col2, col3),
    (col1)
)

++++---++
|col1|col2|col3|grouping_id|rowCount|
++++---++
|col1|col2|col3|          1|       1|
|col1|null|null|          6|       1|
++++---++

Grouping bitmap and associated dimensions: 1 col1, col2
Grouping bitmap and associated dimensions: 6 col3
End test: Grouping sets in different order as group by

{noformat}


Project that produces the above output:

https://github.com/mprice64/SparkGroupingIdBehaviorChange



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39048:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Refactor `GroupBy._reduce_for_stat_function` on accepted data types 
> 
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38880:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39000:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Convert bools to ints in basic statistical functions of GroupBy objects
> ---
>
> Key: SPARK-39000
> URL: https://issues.apache.org/jira/browse/SPARK-39000
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Convert bools to ints in basic statistical functions of GroupBy objects



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39227:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Reach parity with pandas boolean cast
> -
>
> Key: SPARK-39227
> URL: https://issues.apache.org/jira/browse/SPARK-39227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> There are pandas APIs that need boolean casts: all, any.
> Currently, pandas-on-Spark has different behaviors on special inputs against 
> these APIs, for example, empty string, list, etc, as mentioned 
> https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
> [~zero323].
> We shall match pandas behavior on boolean cast.
> Meanwhile, Series/Frame that contains empty strings, lists should be 
> considered as test input to increase test coverage.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38952:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
> --
>
> Key: SPARK-38952
> URL: https://issues.apache.org/jira/browse/SPARK-38952
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38763:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Bug)

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38766:
-
Parent: (was: SPARK-39199)
Issue Type: Improvement  (was: Sub-task)

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38387:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: New Feature)

> Support `na_action` and Series input correspondence in `Series.map`
> ---
>
> Key: SPARK-38387
> URL: https://issues.apache.org/jira/browse/SPARK-38387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `na_action` and Series input correspondence in `Series.map`, in order 
> to reach parity to pandas API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38766:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Bug)

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38400) Enable Series.rename to change index labels

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38400:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Enable Series.rename to change index labels
> ---
>
> Key: SPARK-38400
> URL: https://issues.apache.org/jira/browse/SPARK-38400
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38491:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `ignore_index` of `Series.sort_values`
> --
>
> Key: SPARK-38491
> URL: https://issues.apache.org/jira/browse/SPARK-38491
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `ignore_index` of `Series.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38518:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
> --
>
> Key: SPARK-38518
> URL: https://issues.apache.org/jira/browse/SPARK-38518
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38441:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38479:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: New Feature)

> Add `Series.duplicated` to indicate duplicate Series values.
> 
>
> Key: SPARK-38479
> URL: https://issues.apache.org/jira/browse/SPARK-38479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Add `Series.duplicated` to indicate duplicate Series values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38576:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38608:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
> -
>
> Key: SPARK-38608
> URL: https://issues.apache.org/jira/browse/SPARK-38608
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to 
> include only boolean columns.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38552:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties
> --
>
> Key: SPARK-38552
> URL: https://issues.apache.org/jira/browse/SPARK-38552
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38686:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
> --
>
> Key: SPARK-38686
> URL: https://issues.apache.org/jira/browse/SPARK-38686
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39275) Pass SQL config values as parameters of error classes

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541659#comment-17541659
 ] 

Apache Spark commented on SPARK-39275:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36653

> Pass SQL config values as parameters of error classes
> -
>
> Key: SPARK-39275
> URL: https://issues.apache.org/jira/browse/SPARK-39275
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Pass SQL config values as parameters of error classes. This should align them 
> SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38704:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support string `inclusive` parameter of `Series.between`
> 
>
> Key: SPARK-38704
> URL: https://issues.apache.org/jira/browse/SPARK-38704
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string `inclusive` parameter of `Series.between`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39275) Pass SQL config values as parameters of error classes

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541660#comment-17541660
 ] 

Apache Spark commented on SPARK-39275:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36653

> Pass SQL config values as parameters of error classes
> -
>
> Key: SPARK-39275
> URL: https://issues.apache.org/jira/browse/SPARK-39275
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Pass SQL config values as parameters of error classes. This should align them 
> SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39275) Pass SQL config values as parameters of error classes

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39275:


Assignee: Apache Spark  (was: Max Gekk)

> Pass SQL config values as parameters of error classes
> -
>
> Key: SPARK-39275
> URL: https://issues.apache.org/jira/browse/SPARK-39275
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Pass SQL config values as parameters of error classes. This should align them 
> SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39275) Pass SQL config values as parameters of error classes

2022-05-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39275:


Assignee: Max Gekk  (was: Apache Spark)

> Pass SQL config values as parameters of error classes
> -
>
> Key: SPARK-39275
> URL: https://issues.apache.org/jira/browse/SPARK-39275
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Pass SQL config values as parameters of error classes. This should align them 
> SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39255) Improve error messages

2022-05-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541658#comment-17541658
 ] 

Apache Spark commented on SPARK-39255:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36655

> Improve error messages
> --
>
> Key: SPARK-39255
> URL: https://issues.apache.org/jira/browse/SPARK-39255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Improve the following error messages:
> 1. NON_PARTITION_COLUMN
> 2. UNSUPPORTED_SAVE_MODE
> 3. INVALID_FIELD_NAME
> 4. FAILED_SET_ORIGINAL_PERMISSION_BACK
> 5. NON_LITERAL_PIVOT_VALUES
> 6. INVALID_SYNTAX_FOR_CAST



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39275) Pass SQL config values as parameters of error classes

2022-05-24 Thread Max Gekk (Jira)
Max Gekk created SPARK-39275:


 Summary: Pass SQL config values as parameters of error classes
 Key: SPARK-39275
 URL: https://issues.apache.org/jira/browse/SPARK-39275
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Pass SQL config values as parameters of error classes. This should align them 
SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38726:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `how` parameter of `MultiIndex.dropna`
> --
>
> Key: SPARK-38726
> URL: https://issues.apache.org/jira/browse/SPARK-38726
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `how` parameter of `MultiIndex.dropna`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38765:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38837:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `dropna` parameter of `SeriesGroupBy.value_counts`
> 
>
> Key: SPARK-38837
> URL: https://issues.apache.org/jira/browse/SPARK-38837
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Implement `dropna` parameter of `SeriesGroupBy.value_counts`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38863:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `skipna` parameter of `DataFrame.all`
> ---
>
> Key: SPARK-38863
> URL: https://issues.apache.org/jira/browse/SPARK-38863
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `skipna` parameter of `DataFrame.all`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38793:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
> 
>
> Key: SPARK-38793
> URL: https://issues.apache.org/jira/browse/SPARK-38793
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38903:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38890:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38938:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `inplace` and `columns` parameters of `Series.drop`
> -
>
> Key: SPARK-38938
> URL: https://issues.apache.org/jira/browse/SPARK-38938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` and `columns` parameters of `Series.drop`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38989:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame/Series.sample`
> -
>
> Key: SPARK-38989
> URL: https://issues.apache.org/jira/browse/SPARK-38989
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39201:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`
> ---
>
> Key: SPARK-39201
> URL: https://issues.apache.org/jira/browse/SPARK-39201
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39201:
-
Issue Type: Improvement  (was: Umbrella)

> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`
> ---
>
> Key: SPARK-39201
> URL: https://issues.apache.org/jira/browse/SPARK-39201
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39104) Null Pointer Exeption on unpersist call

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39104:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Null Pointer Exeption on unpersist call
> ---
>
> Key: SPARK-39104
> URL: https://issues.apache.org/jira/browse/SPARK-39104
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Denis
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.4.0
>
>
> DataFrame.unpesist call fails wth NPE
>  
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247)
>     at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241)
>     at 
> org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189)
>     at 
> org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176)
>     at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>     at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>     at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>     at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>     at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>     at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>     at 
> org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219)
>     at 
> org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176)
>     at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220)
>     at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231){code}
> Looks like syncronization in required for 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder#isCachedColumnBuffersLoaded
>  
> {code:java}
> def isCachedColumnBuffersLoaded: Boolean = {
>   _cachedColumnBuffers != null && isCachedRDDLoaded
> }
> def isCachedRDDLoaded: Boolean = {
> _cachedColumnBuffersAreLoaded || {
>   val bmMaster = SparkEnv.get.blockManager.master
>   val rddLoaded = _cachedColumnBuffers.partitions.forall { partition =>
> bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, 
> partition.index), false)
>   .exists { case(_, blockStatus) => blockStatus.isCached }
>   }
>   if (rddLoaded) {
> _cachedColumnBuffersAreLoaded = rddLoaded
>   }
>   rddLoaded
>   }
> } {code}
> isCachedRDDLoaded relies on _cachedColumnBuffers != null check while it can 
> be changed concurrently from other thread. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38681) Support nested generic case classes

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38681:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Support nested generic case classes
> ---
>
> Key: SPARK-38681
> URL: https://issues.apache.org/jira/browse/SPARK-38681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.3.0
>
>
> Spark fail to derive schemas when using nested case class with generic 
> parameters. 
> Example
> {code:java}
> case class GenericData[A](
> genericField: A)
> {code}
> Will derive a correct schema for `GenericData[Int]` but if the classes are 
> nested e.g.
> {code:java}
> case class NestedGeneric[T](
>   generic: GenericData[T])
> {code}
> it will fail to derive a schema for `NestedGeneric[Int]`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39187) Remove SparkIllegalStateException

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39187:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Remove SparkIllegalStateException
> -
>
> Key: SPARK-39187
> URL: https://issues.apache.org/jira/browse/SPARK-39187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Remove SparkIllegalStateException and replace it by IllegalStateException. 
> This will be consistent with other places where IllegalStateException is used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39190) Provide query context for decimal precision overflow error when WSCG is off

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39190:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Provide query context for decimal precision overflow error when WSCG is off
> ---
>
> Key: SPARK-39190
> URL: https://issues.apache.org/jira/browse/SPARK-39190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39183) Upgrade Apache Xerces Java to 2.12.2

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39183:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Upgrade Apache Xerces Java to 2.12.2
> 
>
> Key: SPARK-39183
> URL: https://issues.apache.org/jira/browse/SPARK-39183
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.3.0, 3.2.2, 3.4.0
>
>
> [Infinite Loop in Apache Xerces Java 
> |https://github.com/advisories/GHSA-h65f-jvqw-m9fj]
> There's a vulnerability within the Apache Xerces Java (XercesJ) XML parser 
> when handling specially crafted XML document payloads. This causes, the 
> XercesJ XML parser to wait in an infinite loop, which may sometimes consume 
> system resources for prolonged duration. This vulnerability is present within 
> XercesJ version 2.12.1 and the previous versions.
> References
> https://nvd.nist.gov/vuln/detail/CVE-2022-23437
> https://lists.apache.org/thread/6pjwm10bb69kq955fzr1n0nflnjd27dl
> http://www.openwall.com/lists/oss-security/2022/01/24/3
> https://www.oracle.com/security-alerts/cpuapr2022.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39193) Improve the performance of inferring Timestamp type in JSON/CSV data source

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39193:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Improve the performance of inferring Timestamp type in JSON/CSV data source
> ---
>
> Key: SPARK-39193
> URL: https://issues.apache.org/jira/browse/SPARK-39193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> When reading JSON/CSV files with inferring timestamp types 
> `.option("inferTimestamp", true)`, the Timestamp conversion will throw and 
> catch exceptions. As we are putting decent error messages in the exception, 
> the creation of the exceptions is actually not cheap. It consumes more than 
> 90% of the type inference time. 
> We can use the parsing methods which return optional results instead.
> Before the change, it takes 166 seconds to infer a JSON file of 624MB with 
> inferring timestamp enabled.
> After the change, it only 16 seconds.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39240:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Trivial
> Fix For: 3.3.0, 3.2.2
>
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39216) Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39216:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Do not collapse projects in CombineUnions if it hasCorrelatedSubquery
> -
>
> Key: SPARK-39216
> URL: https://issues.apache.org/jira/browse/SPARK-39216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.0
>
>
>  
> SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't 
> collapse projects that contain correlated subqueries since haven't been 
> de-correlated (PullupCorrelatedPredicates).
> Here is a simple example to reproduce this issue
> {code:java}
> SELECT (SELECT IF(x, 1, 0)) AS a
> FROM (SELECT true) t(x)
> UNION 
> SELECT 1 AS a {code}
> Exception:
> {code:java}
> java.lang.IllegalStateException: Couldn't find x#4 in [] {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39214) Improve errors related to CAST

2022-05-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39214:
-
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Improve errors related to CAST
> --
>
> Key: SPARK-39214
> URL: https://issues.apache.org/jira/browse/SPARK-39214
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW 
> to make more precise and clear.
> 2. Improve error messages of the error classes (use quotes for SQL config and 
> function names).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38506) Push partial aggregation through join

2022-05-24 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541486#comment-17541486
 ] 

Yuming Wang edited comment on SPARK-38506 at 5/24/22 2:40 PM:
--

Benchmark result after [this 
commit|https://github.com/apache/spark/pull/36552/commits/d029bb2c7c003dff28e3af940fb06cd0b14fc6cb]:
|SQL|Before(ms)|With Partial Aggregation Optimization(ms)|
|v1.4 q4|72478|61261|
|v1.4 q5|23235|20971|
|v1.4 q10|13406|8422|
|v1.4 q11|37975|24236|
|v1.4 q14a|154484|52502|
|v1.4 q14b|128712|57363|
|v1.4 q23a|153233|58932|
|v1.4 q23b|159162|78401|
|v1.4 q24a|392441|84826|
|v1.4 q24b|408129|76384|
|v1.4 q31|14696|13766|
|v1.4 q35|29005|17662|
|v1.4 q37|20076|9218|
|v1.4 q47|36560|31299|
|v1.4 q54|12283|11306|
|v1.4 q57|38530|35303|
|v1.4 q69|15839|11968|
|v1.4 q82|24498|13451|
|v1.4 q95|69196|42653|
|v2.7 q6|9129|10527|
|v2.7 q10a|11778|9909|
|v2.7 q11|40113|29130|
|v2.7 q14|159807|38052|
|v2.7 q14a|238149|128097|
|v2.7 q22a|9344|5269|
|v2.7 q35|36910|14705|
|v2.7 q35a|32793|13303|
|v2.7 q47|49689|27308|
|v2.7 q57|26016|28775|
|v2.7 q74|42607|19340|
|modifiedQueries q10|11675|8628|
|modifiedQueries q98|6877|5219|


was (Author: q79969786):
Benchmark result:
 |SQL|Before(ms)|With Partial Aggregation Optimization(ms)|
|v1.4 q4|72478|61261|
|v1.4 q5|23235|20971|
|v1.4 q10|13406|8422|
|v1.4 q11|37975|24236|
|v1.4 q14a|154484|52502|
|v1.4 q14b|128712|57363|
|v1.4 q23a|153233|58932|
|v1.4 q23b|159162|78401|
|v1.4 q24a|392441|84826|
|v1.4 q24b|408129|76384|
|v1.4 q31|14696|13766|
|v1.4 q35|29005|17662|
|v1.4 q37|20076|9218|
|v1.4 q47|36560|31299|
|v1.4 q54|12283|11306|
|v1.4 q57|38530|35303|
|v1.4 q69|15839|11968|
|v1.4 q82|24498|13451|
|v1.4 q95|69196|42653|
|v2.7 q6|9129|10527|
|v2.7 q10a|11778|9909|
|v2.7 q11|40113|29130|
|v2.7 q14|159807|38052|
|v2.7 q14a|238149|128097|
|v2.7 q22a|9344|5269|
|v2.7 q35|36910|14705|
|v2.7 q35a|32793|13303|
|v2.7 q47|49689|27308|
|v2.7 q57|26016|28775|
|v2.7 q74|42607|19340|
|modifiedQueries q10|11675|8628|
|modifiedQueries q98|6877|5219|

> Push partial aggregation through join
> -
>
> Key: SPARK-38506
> URL: https://issues.apache.org/jira/browse/SPARK-38506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Please see 
> https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39274) AttributeError: 'datetime.time' object has no attribute 'timetuple'

2022-05-24 Thread Andreas Fried (Jira)
Andreas Fried created SPARK-39274:
-

 Summary: AttributeError: 'datetime.time' object has no attribute 
'timetuple'
 Key: SPARK-39274
 URL: https://issues.apache.org/jira/browse/SPARK-39274
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Andreas Fried


 
{code:java}
import pandas as pd
import datetime
pdf = pd.DataFrame({'naive': [datetime.time(11, 30, 33, 0)]})
print(pdf)
print(pdf.info())

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sp_df2 = spark.createDataFrame(pdf)
sp_df2.show(10)
{code}
 

throws this error:
 
{code:java}
  naive
0  11:30:33

RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  --  --  - 
 0   naive   1 non-null  object
dtypes: object(1)
memory usage: 136.0+ bytes
None
---
AttributeErrorTraceback (most recent call last)
/usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py in 

 10 spark = SparkSession.builder.getOrCreate()
 11 
---> 12 sp_df2 = spark.createDataFrame(pdf)
 13 sp_df2.show(10)

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
createDataFrame(self, data, schema, samplingRatio, verifySchema)
671 if has_pandas and isinstance(data, pandas.DataFrame):
672 # Create a DataFrame from pandas DataFrame.
--> 673 return super(SparkSession, self).createDataFrame(
674 data, schema, samplingRatio, verifySchema)
675 return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py in 
createDataFrame(self, data, schema, samplingRatio, verifySchema)
338 raise
339 data = self._convert_from_pandas(data, schema, timezone)
--> 340 return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)
341 
342 def _convert_from_pandas(self, pdf, schema, timezone):

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
_create_dataframe(self, data, schema, samplingRatio, verifySchema)
698 rdd, schema = self._createFromRDD(data.map(prepare), 
schema, samplingRatio)
699 else:
--> 700 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
701 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
702 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in 
_createFromLocal(self, data, schema)
523 
524 # convert python objects to sql data
--> 525 data = [schema.toInternal(row) for row in data]
526 return self._sc.parallelize(data), schema
527 

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in (.0)
523 
524 # convert python objects to sql data
--> 525 data = [schema.toInternal(row) for row in data]
526 return self._sc.parallelize(data), schema
527 

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, 
obj)
624  for n, f, c in zip(self.names, 
self.fields, self._needConversion))
625 elif isinstance(obj, (tuple, list)):
--> 626 return tuple(f.toInternal(v) if c else v
627  for f, v, c in zip(self.fields, obj, 
self._needConversion))
628 elif hasattr(obj, "__dict__"):

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in (.0)
624  for n, f, c in zip(self.names, 
self.fields, self._needConversion))
625 elif isinstance(obj, (tuple, list)):
--> 626 return tuple(f.toInternal(v) if c else v
627  for f, v, c in zip(self.fields, obj, 
self._needConversion))
628 elif hasattr(obj, "__dict__"):

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, 
obj)
449 
450 def toInternal(self, obj):
--> 451 return self.dataType.toInternal(obj)
452 
453 def fromInternal(self, obj):

/opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, 
dt)
180 if dt is not None:
181 seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
--> 182else time.mktime(dt.timetuple()))
183 return int(seconds) * 100 + dt.microsecond
184 

AttributeError: 'datetime.time' object has no attribute 'timetuple'{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-

[jira] [Resolved] (SPARK-39256) Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO

2022-05-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39256.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36636
[https://github.com/apache/spark/pull/36636]

> Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO
> --
>
> Key: SPARK-39256
> URL: https://issues.apache.org/jira/browse/SPARK-39256
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> JavaUtils#deleteRecursivelyUsingJavaIO do multiple file attribute calls, can 
> use `Files.readAttributes` merge these to one



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39256) Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO

2022-05-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39256:


Assignee: Yang Jie

> Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO
> --
>
> Key: SPARK-39256
> URL: https://issues.apache.org/jira/browse/SPARK-39256
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> JavaUtils#deleteRecursivelyUsingJavaIO do multiple file attribute calls, can 
> use `Files.readAttributes` merge these to one



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38506) Push partial aggregation through join

2022-05-24 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541486#comment-17541486
 ] 

Yuming Wang commented on SPARK-38506:
-

Benchmark result:
 |SQL|Before(ms)|With Partial Aggregation Optimization(ms)|
|v1.4 q4|72478|61261|
|v1.4 q5|23235|20971|
|v1.4 q10|13406|8422|
|v1.4 q11|37975|24236|
|v1.4 q14a|154484|52502|
|v1.4 q14b|128712|57363|
|v1.4 q23a|153233|58932|
|v1.4 q23b|159162|78401|
|v1.4 q24a|392441|84826|
|v1.4 q24b|408129|76384|
|v1.4 q31|14696|13766|
|v1.4 q35|29005|17662|
|v1.4 q37|20076|9218|
|v1.4 q47|36560|31299|
|v1.4 q54|12283|11306|
|v1.4 q57|38530|35303|
|v1.4 q69|15839|11968|
|v1.4 q82|24498|13451|
|v1.4 q95|69196|42653|
|v2.7 q6|9129|10527|
|v2.7 q10a|11778|9909|
|v2.7 q11|40113|29130|
|v2.7 q14|159807|38052|
|v2.7 q14a|238149|128097|
|v2.7 q22a|9344|5269|
|v2.7 q35|36910|14705|
|v2.7 q35a|32793|13303|
|v2.7 q47|49689|27308|
|v2.7 q57|26016|28775|
|v2.7 q74|42607|19340|
|modifiedQueries q10|11675|8628|
|modifiedQueries q98|6877|5219|

> Push partial aggregation through join
> -
>
> Key: SPARK-38506
> URL: https://issues.apache.org/jira/browse/SPARK-38506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Please see 
> https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >