[jira] [Assigned] (SPARK-39285) Spark should not check filed name when read data
[ https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39285: Assignee: (was: Apache Spark) > Spark should not check filed name when read data > > > Key: SPARK-39285 > URL: https://issues.apache.org/jira/browse/SPARK-39285 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > > Spark should not check read data when read data -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39285) Spark should not check filed name when read data
[ https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541850#comment-17541850 ] Apache Spark commented on SPARK-39285: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/36661 > Spark should not check filed name when read data > > > Key: SPARK-39285 > URL: https://issues.apache.org/jira/browse/SPARK-39285 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > > Spark should not check read data when read data -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39285) Spark should not check filed name when read data
[ https://issues.apache.org/jira/browse/SPARK-39285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39285: Assignee: Apache Spark > Spark should not check filed name when read data > > > Key: SPARK-39285 > URL: https://issues.apache.org/jira/browse/SPARK-39285 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Spark should not check read data when read data -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39285) Spark should not check filed name when read data
angerszhu created SPARK-39285: - Summary: Spark should not check filed name when read data Key: SPARK-39285 URL: https://issues.apache.org/jira/browse/SPARK-39285 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: angerszhu Spark should not check read data when read data -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39284) Implement Groupby.mad
[ https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541835#comment-17541835 ] Apache Spark commented on SPARK-39284: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/36660 > Implement Groupby.mad > - > > Key: SPARK-39284 > URL: https://issues.apache.org/jira/browse/SPARK-39284 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39284) Implement Groupby.mad
[ https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39284: Assignee: (was: Apache Spark) > Implement Groupby.mad > - > > Key: SPARK-39284 > URL: https://issues.apache.org/jira/browse/SPARK-39284 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39284) Implement Groupby.mad
[ https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39284: Assignee: Apache Spark > Implement Groupby.mad > - > > Key: SPARK-39284 > URL: https://issues.apache.org/jira/browse/SPARK-39284 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39284) Implement Groupby.mad
zhengruifeng created SPARK-39284: Summary: Implement Groupby.mad Key: SPARK-39284 URL: https://issues.apache.org/jira/browse/SPARK-39284 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
[ https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541805#comment-17541805 ] xiangxiang Shen commented on SPARK-39282: - CC [~ueshin] ,[~cloud_fan] , Thanks! > Replace If-Else branch with bitwise operators in > roundNumberOfBytesToNearestWord > > > Key: SPARK-39282 > URL: https://issues.apache.org/jira/browse/SPARK-39282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: xiangxiang Shen >Priority: Major > > [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] > > Here,can use bitwise operators to avoid If-Else branch and improve > computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
[ https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541804#comment-17541804 ] Apache Spark commented on SPARK-39282: -- User 'zhixingheyi-tian' has created a pull request for this issue: https://github.com/apache/spark/pull/36659 > Replace If-Else branch with bitwise operators in > roundNumberOfBytesToNearestWord > > > Key: SPARK-39282 > URL: https://issues.apache.org/jira/browse/SPARK-39282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: xiangxiang Shen >Priority: Major > > [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] > > Here,can use bitwise operators to avoid If-Else branch and improve > computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Description: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : !DeadlockSparkTasks.png! I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 was: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !DeadlockSparkTasks.png! > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Attachment: DeadlockSparkTasks.png > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Affects Version/s: 3.1.2 (was: 3.0.0) > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !DeadlockSparkTasks.png! > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
[ https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541803#comment-17541803 ] Apache Spark commented on SPARK-39282: -- User 'zhixingheyi-tian' has created a pull request for this issue: https://github.com/apache/spark/pull/36659 > Replace If-Else branch with bitwise operators in > roundNumberOfBytesToNearestWord > > > Key: SPARK-39282 > URL: https://issues.apache.org/jira/browse/SPARK-39282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: xiangxiang Shen >Priority: Major > > [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] > > Here,can use bitwise operators to avoid If-Else branch and improve > computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
[ https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39282: Assignee: Apache Spark > Replace If-Else branch with bitwise operators in > roundNumberOfBytesToNearestWord > > > Key: SPARK-39282 > URL: https://issues.apache.org/jira/browse/SPARK-39282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: xiangxiang Shen >Assignee: Apache Spark >Priority: Major > > [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] > > Here,can use bitwise operators to avoid If-Else branch and improve > computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
[ https://issues.apache.org/jira/browse/SPARK-39282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39282: Assignee: (was: Apache Spark) > Replace If-Else branch with bitwise operators in > roundNumberOfBytesToNearestWord > > > Key: SPARK-39282 > URL: https://issues.apache.org/jira/browse/SPARK-39282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: xiangxiang Shen >Priority: Major > > [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] > > Here,can use bitwise operators to avoid If-Else branch and improve > computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Description: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 was: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : !DeadlockSparkTasks.png! I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Attachment: (was: DeadlockSparkTasks.png) > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Labels: Deadlock spark3.0 (was: ) > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !DeadlockSparkTasks.png! > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Description: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : !DeadlockSparkTasks.png! I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 was: We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : !image-2022-05-24-20-03-35-287.png! I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !DeadlockSparkTasks.png! > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
Sandeep Pal created SPARK-39283: --- Summary: Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter Key: SPARK-39283 URL: https://issues.apache.org/jira/browse/SPARK-39283 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Sandeep Pal Attachments: DeadlockSparkTasks.png We are seems this deadlock between {{TaskMemoryManager}} and {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is successful but sometimes we have to do hacky ways to break the deadlocks such as turning down the worker machines explicitly. Below is the thread dump from the Spark UI showing the deadlock : !image-2022-05-24-20-03-35-287.png! I believe there was a related Jira on the similar deadlock between the same threads and it was resolved. https://issues.apache.org/jira/browse/SPARK-27338 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-39283: Attachment: DeadlockSparkTasks.png > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Pal >Priority: Critical > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !image-2022-05-24-20-03-35-287.png! > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39282) Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
xiangxiang Shen created SPARK-39282: --- Summary: Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord Key: SPARK-39282 URL: https://issues.apache.org/jira/browse/SPARK-39282 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: xiangxiang Shen [https://github.com/apache/spark/blob/a6dd6076d708713d11585bf7f3401d522ea48822/common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java#L40-L47] Here,can use bitwise operators to avoid If-Else branch and improve computation performance. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39281) Fasten Timestamp type inference of legacy format in JSON/CSV data source
Gengliang Wang created SPARK-39281: -- Summary: Fasten Timestamp type inference of legacy format in JSON/CSV data source Key: SPARK-39281 URL: https://issues.apache.org/jira/browse/SPARK-39281 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39280) Fasten Timestamp type inference with user-provided format in JSON/CSV data source
Gengliang Wang created SPARK-39280: -- Summary: Fasten Timestamp type inference with user-provided format in JSON/CSV data source Key: SPARK-39280 URL: https://issues.apache.org/jira/browse/SPARK-39280 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39193) Fasten Timestamp type inference of default format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39193: --- Parent: SPARK-39279 Issue Type: Sub-task (was: Improvement) > Fasten Timestamp type inference of default format in JSON/CSV data source > - > > Key: SPARK-39193 > URL: https://issues.apache.org/jira/browse/SPARK-39193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > When reading JSON/CSV files with inferring timestamp types > `.option("inferTimestamp", true)`, the Timestamp conversion will throw and > catch exceptions. As we are putting decent error messages in the exception, > the creation of the exceptions is actually not cheap. It consumes more than > 90% of the type inference time. > We can use the parsing methods which return optional results instead. > Before the change, it takes 166 seconds to infer a JSON file of 624MB with > inferring timestamp enabled. > After the change, it only 16 seconds. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39193) Fasten Timestamp type inference of default format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39193: --- Summary: Fasten Timestamp type inference of default format in JSON/CSV data source (was: Improve the performance of inferring Timestamp type in JSON/CSV data source) > Fasten Timestamp type inference of default format in JSON/CSV data source > - > > Key: SPARK-39193 > URL: https://issues.apache.org/jira/browse/SPARK-39193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > When reading JSON/CSV files with inferring timestamp types > `.option("inferTimestamp", true)`, the Timestamp conversion will throw and > catch exceptions. As we are putting decent error messages in the exception, > the creation of the exceptions is actually not cheap. It consumes more than > 90% of the type inference time. > We can use the parsing methods which return optional results instead. > Before the change, it takes 166 seconds to infer a JSON file of 624MB with > inferring timestamp enabled. > After the change, it only 16 seconds. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39279) Fasten the schema inference of CSV/JSON data source
Gengliang Wang created SPARK-39279: -- Summary: Fasten the schema inference of CSV/JSON data source Key: SPARK-39279 URL: https://issues.apache.org/jira/browse/SPARK-39279 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang In the current implementation of CSV/JSON data source, the schema inference relies on methods that will throw exceptions if the fields can't convert as some data types. Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead. A good example is [https://github.com/apache/spark/pull/36562|https://github.com/apache/spark/pull/36562,], which reduces the schema inference time by 90%. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
[ https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39252. -- Fix Version/s: 3.1.3 3.2.2 3.3.1 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/36656 > Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty > > > Key: SPARK-39252 > URL: https://issues.apache.org/jira/browse/SPARK-39252 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.1.3, 3.2.2, 3.3.1 > > > {{test_df_is_empty}} is flaky. For example, a recent PR: > https://github.com/apache/spark/pull/36580 > https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true > Possibly introduced from SPARK-39084 > {code} > test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... > [Stage 6:> (0 + 1) / > 1] > > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700 > # > # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) > (8.0_332-b09) (build 1.8.0_332-b09) > # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 9116 C2 > org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 > bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f] > # > # Core dump written. Default location: /__w/spark/spark/core or core.4021 > # > # An error report file with more information is saved as: > # /__w/spark/spark/hs_err_pid4021.log > # > # If you would like to submit a bug report, please visit: > # http://www.azul.com/support/ > # > > Exception happened during processing of request from ('127.0.0.1', 36358) > Traceback (most recent call last): > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in > process_request > self.finish_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__ > self.handle() > File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle > poll(accum_updates) > File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll > if func(): > File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in > accum_updates > num_updates = read_int(self.rfile) > File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int > raise EOFError > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
[ https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39252: Assignee: Ivan Sadikov > Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty > > > Key: SPARK-39252 > URL: https://issues.apache.org/jira/browse/SPARK-39252 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Hyukjin Kwon >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.1.3, 3.2.2, 3.3.1 > > > {{test_df_is_empty}} is flaky. For example, a recent PR: > https://github.com/apache/spark/pull/36580 > https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true > Possibly introduced from SPARK-39084 > {code} > test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... > [Stage 6:> (0 + 1) / > 1] > > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700 > # > # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) > (8.0_332-b09) (build 1.8.0_332-b09) > # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 9116 C2 > org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 > bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f] > # > # Core dump written. Default location: /__w/spark/spark/core or core.4021 > # > # An error report file with more information is saved as: > # /__w/spark/spark/hs_err_pid4021.log > # > # If you would like to submit a bug report, please visit: > # http://www.azul.com/support/ > # > > Exception happened during processing of request from ('127.0.0.1', 36358) > Traceback (most recent call last): > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in > process_request > self.finish_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__ > self.handle() > File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle > poll(accum_updates) > File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll > if func(): > File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in > accum_updates > num_updates = read_int(self.rfile) > File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int > raise EOFError > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39278: Assignee: Apache Spark > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Assignee: Apache Spark >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39278: Assignee: (was: Apache Spark) > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541779#comment-17541779 ] Apache Spark commented on SPARK-39278: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/36658 > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541781#comment-17541781 ] Apache Spark commented on SPARK-39278: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/36658 > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39220) codegen cause NullPointException
[ https://issues.apache.org/jira/browse/SPARK-39220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39220. -- Resolution: Cannot Reproduce > codegen cause NullPointException > > > Key: SPARK-39220 > URL: https://issues.apache.org/jira/browse/SPARK-39220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.6, 3.2.1 >Reporter: chenxusheng >Priority: Major > > The following code raises NullPointException > {code:sql} > SELECT > fk4c7a8cfc, > fka54f2a73, > fk37e266f7 > FROM > be2a04fad4a24848bee641825e5b3466 > WHERE > ( > fk4c7a8cfc is not null > and fk4c7a8cfc<> '' > ) > LIMIT > 1000 > {code} > However, if so, it is normal > {code:sql} > SELECT > fk4c7a8cfc, > fka54f2a73, > fk37e266f7 > FROM > be2a04fad4a24848bee641825e5b3466 > WHERE > ( > fk4c7a8cfc is not null > and '' <> fk4c7a8cfc > ) > LIMIT > 1000 > {code} > I just put the '' in where in front. > The reason for this problem is that the data contains null values. > *_org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext#genEqual_* > {code:scala} > def genEqual(dataType: DataType, c1: String, c2: String): String = dataType > match { > case BinaryType => s"java.util.Arrays.equals($c1, $c2)" > case FloatType => > s"((java.lang.Float.isNaN($c1) && java.lang.Float.isNaN($c2)) || $c1 == > $c2)" > case DoubleType => > s"((java.lang.Double.isNaN($c1) && java.lang.Double.isNaN($c2)) || $c1 > == $c2)" > case dt: DataType if isPrimitiveType(dt) => s"$c1 == $c2" > case dt: DataType if dt.isInstanceOf[AtomicType] => s"$c1.equals($c2)" > case array: ArrayType => genComp(array, c1, c2) + " == 0" > case struct: StructType => genComp(struct, c1, c2) + " == 0" > case udt: UserDefinedType[_] => genEqual(udt.sqlType, c1, c2) > case NullType => "false" > case _ => > throw new IllegalArgumentException( > "cannot generate equality code for un-comparable type: " + > dataType.catalogString) > } > {code} > {code:scala} > case dt: DataType if dt.isInstanceOf[AtomicType] => s"$c1.equals($c2)" > {code} > Missing null value judgment? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39274) AttributeError: 'datetime.time' object has no attribute 'timetuple'
[ https://issues.apache.org/jira/browse/SPARK-39274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541773#comment-17541773 ] Hyukjin Kwon commented on SPARK-39274: -- We don;t current'y corresponding mapping of datetime.time in PySpark <> Spark SQL. > AttributeError: 'datetime.time' object has no attribute 'timetuple' > --- > > Key: SPARK-39274 > URL: https://issues.apache.org/jira/browse/SPARK-39274 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Andreas Fried >Priority: Major > > > {code:java} > import pandas as pd > import datetime > pdf = pd.DataFrame({'naive': [datetime.time(11, 30, 33, 0)]}) > print(pdf) > print(pdf.info()) > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > sp_df2 = spark.createDataFrame(pdf) > sp_df2.show(10) > {code} > > throws this error: > > {code:java} > naive > 0 11:30:33 > > RangeIndex: 1 entries, 0 to 0 > Data columns (total 1 columns): > # Column Non-Null Count Dtype > --- -- -- - > 0 naive 1 non-null object > dtypes: object(1) > memory usage: 136.0+ bytes > None > --- > AttributeErrorTraceback (most recent call last) > /usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py in > > 10 spark = SparkSession.builder.getOrCreate() > 11 > ---> 12 sp_df2 = spark.createDataFrame(pdf) > 13 sp_df2.show(10) > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in > createDataFrame(self, data, schema, samplingRatio, verifySchema) > 671 if has_pandas and isinstance(data, pandas.DataFrame): > 672 # Create a DataFrame from pandas DataFrame. > --> 673 return super(SparkSession, self).createDataFrame( > 674 data, schema, samplingRatio, verifySchema) > 675 return self._create_dataframe(data, schema, samplingRatio, > verifySchema) > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py in > createDataFrame(self, data, schema, samplingRatio, verifySchema) > 338 raise > 339 data = self._convert_from_pandas(data, schema, timezone) > --> 340 return self._create_dataframe(data, schema, samplingRatio, > verifySchema) > 341 > 342 def _convert_from_pandas(self, pdf, schema, timezone): > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in > _create_dataframe(self, data, schema, samplingRatio, verifySchema) > 698 rdd, schema = self._createFromRDD(data.map(prepare), > schema, samplingRatio) > 699 else: > --> 700 rdd, schema = self._createFromLocal(map(prepare, data), > schema) > 701 jrdd = > self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) > 702 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), > schema.json()) > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in > _createFromLocal(self, data, schema) > 523 > 524 # convert python objects to sql data > --> 525 data = [schema.toInternal(row) for row in data] > 526 return self._sc.parallelize(data), schema > 527 > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in (.0) > 523 > 524 # convert python objects to sql data > --> 525 data = [schema.toInternal(row) for row in data] > 526 return self._sc.parallelize(data), schema > 527 > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in > toInternal(self, obj) > 624 for n, f, c in zip(self.names, > self.fields, self._needConversion)) > 625 elif isinstance(obj, (tuple, list)): > --> 626 return tuple(f.toInternal(v) if c else v > 627 for f, v, c in zip(self.fields, obj, > self._needConversion)) > 628 elif hasattr(obj, "__dict__"): > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in (.0) > 624 for n, f, c in zip(self.names, > self.fields, self._needConversion)) > 625 elif isinstance(obj, (tuple, list)): > --> 626 return tuple(f.toInternal(v) if c else v > 627 for f, v, c in zip(self.fields, obj, > self._needConversion)) > 628 elif hasattr(obj, "__dict__"): > /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in > toInternal(self, obj) > 449 > 450 def toInternal(self, obj): > --> 451 return self.dataType.toInternal(obj) > 452 > 453 def fromInternal(self, obj)
[jira] [Updated] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-39278: --- Description: Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR will break backward compatibility and cause application failure. was: Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR breaks backward compatibility and cause application failure. > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
Manu Zhang created SPARK-39278: -- Summary: Alternative configs of Hadoop Filesystems to access break backward compatibility Key: SPARK-39278 URL: https://issues.apache.org/jira/browse/SPARK-39278 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Manu Zhang Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR breaks backward compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39277) Make Optimizer extends SQLConfHelper
[ https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39277: Assignee: (was: Apache Spark) > Make Optimizer extends SQLConfHelper > > > Key: SPARK-39277 > URL: https://issues.apache.org/jira/browse/SPARK-39277 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39277) Make Optimizer extends SQLConfHelper
[ https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39277: Assignee: Apache Spark > Make Optimizer extends SQLConfHelper > > > Key: SPARK-39277 > URL: https://issues.apache.org/jira/browse/SPARK-39277 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39277) Make Optimizer extends SQLConfHelper
[ https://issues.apache.org/jira/browse/SPARK-39277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541770#comment-17541770 ] Apache Spark commented on SPARK-39277: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/36657 > Make Optimizer extends SQLConfHelper > > > Key: SPARK-39277 > URL: https://issues.apache.org/jira/browse/SPARK-39277 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39277) Make Optimizer extends SQLConfHelper
Yuming Wang created SPARK-39277: --- Summary: Make Optimizer extends SQLConfHelper Key: SPARK-39277 URL: https://issues.apache.org/jira/browse/SPARK-39277 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39273) Make PandasOnSparkTestCase inherit ReusedSQLTestCase
[ https://issues.apache.org/jira/browse/SPARK-39273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39273. -- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 36652 [https://github.com/apache/spark/pull/36652] > Make PandasOnSparkTestCase inherit ReusedSQLTestCase > > > Key: SPARK-39273 > URL: https://issues.apache.org/jira/browse/SPARK-39273 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > PandasOnSparkTestCase has some legacy codes e.g., not stopping in stop > {{tearDownClass}}. We don't need such logic anymore. That logic was when the > codes are in Koalas repo that runs the tests sequentially. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39273) Make PandasOnSparkTestCase inherit ReusedSQLTestCase
[ https://issues.apache.org/jira/browse/SPARK-39273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39273: Assignee: Hyukjin Kwon > Make PandasOnSparkTestCase inherit ReusedSQLTestCase > > > Key: SPARK-39273 > URL: https://issues.apache.org/jira/browse/SPARK-39273 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > PandasOnSparkTestCase has some legacy codes e.g., not stopping in stop > {{tearDownClass}}. We don't need such logic anymore. That logic was when the > codes are in Koalas repo that runs the tests sequentially. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch
[ https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39053. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36391 [https://github.com/apache/spark/pull/36391] > test_multi_index_dtypes failed due to index mismatch > > > Key: SPARK-39053 > URL: https://issues.apache.org/jira/browse/SPARK-39053 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > {code:java} > DataFrameTest.test_multi_index_dtypesSeries.index are different > Series.index classes are different > [left]: MultiIndex([('zero', 'first'), > ( 'one', 'second')], >) > [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object') > Left: > zero first int64 > one secondobject > dtype: object > object > Right: > (zero, first) int64 > (one, second)object > dtype: object > object {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch
[ https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39053: Assignee: Yikun Jiang > test_multi_index_dtypes failed due to index mismatch > > > Key: SPARK-39053 > URL: https://issues.apache.org/jira/browse/SPARK-39053 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > {code:java} > DataFrameTest.test_multi_index_dtypesSeries.index are different > Series.index classes are different > [left]: MultiIndex([('zero', 'first'), > ( 'one', 'second')], >) > [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object') > Left: > zero first int64 > one secondobject > dtype: object > object > Right: > (zero, first) int64 > (one, second)object > dtype: object > object {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
[ https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39252: Assignee: Apache Spark > Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty > > > Key: SPARK-39252 > URL: https://issues.apache.org/jira/browse/SPARK-39252 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {{test_df_is_empty}} is flaky. For example, a recent PR: > https://github.com/apache/spark/pull/36580 > https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true > Possibly introduced from SPARK-39084 > {code} > test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... > [Stage 6:> (0 + 1) / > 1] > > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700 > # > # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) > (8.0_332-b09) (build 1.8.0_332-b09) > # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 9116 C2 > org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 > bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f] > # > # Core dump written. Default location: /__w/spark/spark/core or core.4021 > # > # An error report file with more information is saved as: > # /__w/spark/spark/hs_err_pid4021.log > # > # If you would like to submit a bug report, please visit: > # http://www.azul.com/support/ > # > > Exception happened during processing of request from ('127.0.0.1', 36358) > Traceback (most recent call last): > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in > process_request > self.finish_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__ > self.handle() > File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle > poll(accum_updates) > File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll > if func(): > File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in > accum_updates > num_updates = read_int(self.rfile) > File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int > raise EOFError > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
[ https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39252: Assignee: (was: Apache Spark) > Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty > > > Key: SPARK-39252 > URL: https://issues.apache.org/jira/browse/SPARK-39252 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Hyukjin Kwon >Priority: Major > > {{test_df_is_empty}} is flaky. For example, a recent PR: > https://github.com/apache/spark/pull/36580 > https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true > Possibly introduced from SPARK-39084 > {code} > test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... > [Stage 6:> (0 + 1) / > 1] > > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700 > # > # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) > (8.0_332-b09) (build 1.8.0_332-b09) > # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 9116 C2 > org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 > bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f] > # > # Core dump written. Default location: /__w/spark/spark/core or core.4021 > # > # An error report file with more information is saved as: > # /__w/spark/spark/hs_err_pid4021.log > # > # If you would like to submit a bug report, please visit: > # http://www.azul.com/support/ > # > > Exception happened during processing of request from ('127.0.0.1', 36358) > Traceback (most recent call last): > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in > process_request > self.finish_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__ > self.handle() > File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle > poll(accum_updates) > File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll > if func(): > File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in > accum_updates > num_updates = read_int(self.rfile) > File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int > raise EOFError > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39252) Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
[ https://issues.apache.org/jira/browse/SPARK-39252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541751#comment-17541751 ] Apache Spark commented on SPARK-39252: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/36656 > Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty > > > Key: SPARK-39252 > URL: https://issues.apache.org/jira/browse/SPARK-39252 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Hyukjin Kwon >Priority: Major > > {{test_df_is_empty}} is flaky. For example, a recent PR: > https://github.com/apache/spark/pull/36580 > https://github.com/panbingkun/spark/runs/6525997469?check_suite_focus=true > Possibly introduced from SPARK-39084 > {code} > test_df_is_empty (pyspark.sql.tests.test_dataframe.DataFrameTests) ... > [Stage 6:> (0 + 1) / > 1] > > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7fd84a1486ff, pid=4021, tid=0x7fd8016a2700 > # > # JRE version: OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) > (8.0_332-b09) (build 1.8.0_332-b09) > # Java VM: OpenJDK 64-Bit Server VM (25.332-b09 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 9116 C2 > org.apache.spark.unsafe.UnsafeAlignedOffset.getSize(Ljava/lang/Object;J)I (51 > bytes) @ 0x7fd84a1486ff [0x7fd84a1486e0+0x1f] > # > # Core dump written. Default location: /__w/spark/spark/core or core.4021 > # > # An error report file with more information is saved as: > # /__w/spark/spark/hs_err_pid4021.log > # > # If you would like to submit a bug report, please visit: > # http://www.azul.com/support/ > # > > Exception happened during processing of request from ('127.0.0.1', 36358) > Traceback (most recent call last): > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 316, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 347, in > process_request > self.finish_request(request, client_address) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 360, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/pypy3/lib-python/3/socketserver.py", line 720, in __init__ > self.handle() > File "/__w/spark/spark/python/pyspark/accumulators.py", line 281, in handle > poll(accum_updates) > File "/__w/spark/spark/python/pyspark/accumulators.py", line 253, in poll > if func(): > File "/__w/spark/spark/python/pyspark/accumulators.py", line 257, in > accum_updates > num_updates = read_int(self.rfile) > File "/__w/spark/spark/python/pyspark/serializers.py", line 595, in read_int > raise EOFError > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39276) grouping_id() behavior changed between 3.1.x and 3.2.x
Martin Price created SPARK-39276: Summary: grouping_id() behavior changed between 3.1.x and 3.2.x Key: SPARK-39276 URL: https://issues.apache.org/jira/browse/SPARK-39276 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Martin Price It appears that Spark 3.1.x respected the order of columns in the `group by` clause to determine what the each bit in "grouping_id()" referred to. In spark 3.2.x it appears to use the order columns first appear in the `grouping sets` clause. We use the grouping_id() to direct different levels of aggregation to different tables, so this change in behavior resulted in those pipelines breaking. 3.1.3 behavior: The grouping_id bitmaps between the two queries are the same: {noformat} -- Start test: Grouping sets in same order as group by SELECT 'col1' as col1, 'col2' as col2, 'col3' as col3, grouping_id() as grouping_id, count(1) as rowCount from values(1) GROUP BY col1, col2, col3 GROUPING SETS ( (col1), (col2, col3) ) ++++---++ |col1|col2|col3|grouping_id|rowCount| ++++---++ |col1|null|null| 3| 1| |col1|col2|col3| 4| 1| ++++---++ Grouping bitmap and associated dimensions: 3 col1 Grouping bitmap and associated dimensions: 4 col2, col3 End test: Grouping sets in same order as group by -- Start test: Grouping sets in different order as group by SELECT 'col1' as col1, 'col2' as col2, 'col3' as col3, grouping_id() as grouping_id, count(1) as rowCount from values(1) GROUP BY col1, col2, col3 GROUPING SETS ( (col2, col3), (col1) ) ++++---++ |col1|col2|col3|grouping_id|rowCount| ++++---++ |col1|null|null| 3| 1| |col1|col2|col3| 4| 1| ++++---++ Grouping bitmap and associated dimensions: 3 col1 Grouping bitmap and associated dimensions: 4 col2, col3 End test: Grouping sets in different order as group by{noformat} # 3.2.1 output The grouping_id bitmap changes between the two queries based on the order columns appear in the grouping sets clause. {noformat} -- Start test: Grouping sets in same order as group by SELECT 'col1' as col1, 'col2' as col2, 'col3' as col3, grouping_id() as grouping_id, count(1) as rowCount from values(1) GROUP BY col1, col2, col3 GROUPING SETS ( (col1), (col2, col3) ) ++++---++ |col1|col2|col3|grouping_id|rowCount| ++++---++ |col1|null|null| 3| 1| |col1|col2|col3| 4| 1| ++++---++ Grouping bitmap and associated dimensions: 3 col1 Grouping bitmap and associated dimensions: 4 col2, col3 End test: Grouping sets in same order as group by -- Start test: Grouping sets in different order as group by SELECT 'col1' as col1, 'col2' as col2, 'col3' as col3, grouping_id() as grouping_id, count(1) as rowCount from values(1) GROUP BY col1, col2, col3 GROUPING SETS ( (col2, col3), (col1) ) ++++---++ |col1|col2|col3|grouping_id|rowCount| ++++---++ |col1|col2|col3| 1| 1| |col1|null|null| 6| 1| ++++---++ Grouping bitmap and associated dimensions: 1 col1, col2 Grouping bitmap and associated dimensions: 6 col3 End test: Grouping sets in different order as group by {noformat} Project that produces the above output: https://github.com/mprice64/SparkGroupingIdBehaviorChange -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types
[ https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39048: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Refactor `GroupBy._reduce_for_stat_function` on accepted data types > > > Key: SPARK-39048 > URL: https://issues.apache.org/jira/browse/SPARK-39048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > `Groupby._reduce_for_stat_function` is a common helper function leveraged by > multiple statistical functions of GroupBy objects. > It defines parameters `only_numeric` and `bool_as_numeric` to control > accepted Spark types. > To be consistent with pandas API, we may also have to introduce > `str_as_numeric` for `sum` for example. > Instead of introducing parameters designated for each Spark type, the PR is > proposed to introduce a parameter `accepted_spark_types` to specify accepted > types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38880: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects
[ https://issues.apache.org/jira/browse/SPARK-39000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39000: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Convert bools to ints in basic statistical functions of GroupBy objects > --- > > Key: SPARK-39000 > URL: https://issues.apache.org/jira/browse/SPARK-39000 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Convert bools to ints in basic statistical functions of GroupBy objects -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast
[ https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39227: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Reach parity with pandas boolean cast > - > > Key: SPARK-39227 > URL: https://issues.apache.org/jira/browse/SPARK-39227 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > There are pandas APIs that need boolean casts: all, any. > Currently, pandas-on-Spark has different behaviors on special inputs against > these APIs, for example, empty string, list, etc, as mentioned > https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by > [~zero323]. > We shall match pandas behavior on boolean cast. > Meanwhile, Series/Frame that contains empty strings, lists should be > considered as test input to increase test coverage. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
[ https://issues.apache.org/jira/browse/SPARK-38952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38952: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` > -- > > Key: SPARK-38952 > URL: https://issues.apache.org/jira/browse/SPARK-38952 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38763: - Parent: SPARK-39199 Issue Type: Sub-task (was: Bug) > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`
[ https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38766: - Parent: (was: SPARK-39199) Issue Type: Improvement (was: Sub-task) > Support lambda `column` parameter of `DataFrame.rename` > --- > > Key: SPARK-38766 > URL: https://issues.apache.org/jira/browse/SPARK-38766 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support lambda `column` parameter of `DataFrame.rename`. > The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`
[ https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38387: - Parent: SPARK-39199 Issue Type: Sub-task (was: New Feature) > Support `na_action` and Series input correspondence in `Series.map` > --- > > Key: SPARK-38387 > URL: https://issues.apache.org/jira/browse/SPARK-38387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `na_action` and Series input correspondence in `Series.map`, in order > to reach parity to pandas API. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`
[ https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38766: - Parent: SPARK-39199 Issue Type: Sub-task (was: Bug) > Support lambda `column` parameter of `DataFrame.rename` > --- > > Key: SPARK-38766 > URL: https://issues.apache.org/jira/browse/SPARK-38766 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support lambda `column` parameter of `DataFrame.rename`. > The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38400) Enable Series.rename to change index labels
[ https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38400: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Enable Series.rename to change index labels > --- > > Key: SPARK-38400 > URL: https://issues.apache.org/jira/browse/SPARK-38400 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38491: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `ignore_index` of `Series.sort_values` > -- > > Key: SPARK-38491 > URL: https://issues.apache.org/jira/browse/SPARK-38491 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `ignore_index` of `Series.sort_values` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
[ https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38518: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values > -- > > Key: SPARK-38518 > URL: https://issues.apache.org/jira/browse/SPARK-38518 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`
[ https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38441: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support string and bool `regex` in `Series.replace` > --- > > Key: SPARK-38441 > URL: https://issues.apache.org/jira/browse/SPARK-38441 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string and bool `regex` in `Series.replace` in order to reach parity > with pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.
[ https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38479: - Parent: SPARK-39199 Issue Type: Sub-task (was: New Feature) > Add `Series.duplicated` to indicate duplicate Series values. > > > Key: SPARK-38479 > URL: https://issues.apache.org/jira/browse/SPARK-38479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Add `Series.duplicated` to indicate duplicate Series values. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38576: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
[ https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38608: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` > - > > Key: SPARK-38608 > URL: https://issues.apache.org/jira/browse/SPARK-38608 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to > include only boolean columns. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties
[ https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38552: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties > -- > > Key: SPARK-38552 > URL: https://issues.apache.org/jira/browse/SPARK-38552 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38686: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` > -- > > Key: SPARK-38686 > URL: https://issues.apache.org/jira/browse/SPARK-38686 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39275) Pass SQL config values as parameters of error classes
[ https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541659#comment-17541659 ] Apache Spark commented on SPARK-39275: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36653 > Pass SQL config values as parameters of error classes > - > > Key: SPARK-39275 > URL: https://issues.apache.org/jira/browse/SPARK-39275 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Pass SQL config values as parameters of error classes. This should align them > SQL configs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`
[ https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38704: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support string `inclusive` parameter of `Series.between` > > > Key: SPARK-38704 > URL: https://issues.apache.org/jira/browse/SPARK-38704 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string `inclusive` parameter of `Series.between` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39275) Pass SQL config values as parameters of error classes
[ https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541660#comment-17541660 ] Apache Spark commented on SPARK-39275: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36653 > Pass SQL config values as parameters of error classes > - > > Key: SPARK-39275 > URL: https://issues.apache.org/jira/browse/SPARK-39275 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Pass SQL config values as parameters of error classes. This should align them > SQL configs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39275) Pass SQL config values as parameters of error classes
[ https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39275: Assignee: Apache Spark (was: Max Gekk) > Pass SQL config values as parameters of error classes > - > > Key: SPARK-39275 > URL: https://issues.apache.org/jira/browse/SPARK-39275 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Pass SQL config values as parameters of error classes. This should align them > SQL configs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39275) Pass SQL config values as parameters of error classes
[ https://issues.apache.org/jira/browse/SPARK-39275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39275: Assignee: Max Gekk (was: Apache Spark) > Pass SQL config values as parameters of error classes > - > > Key: SPARK-39275 > URL: https://issues.apache.org/jira/browse/SPARK-39275 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Pass SQL config values as parameters of error classes. This should align them > SQL configs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39255) Improve error messages
[ https://issues.apache.org/jira/browse/SPARK-39255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541658#comment-17541658 ] Apache Spark commented on SPARK-39255: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36655 > Improve error messages > -- > > Key: SPARK-39255 > URL: https://issues.apache.org/jira/browse/SPARK-39255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Improve the following error messages: > 1. NON_PARTITION_COLUMN > 2. UNSUPPORTED_SAVE_MODE > 3. INVALID_FIELD_NAME > 4. FAILED_SET_ORIGINAL_PERMISSION_BACK > 5. NON_LITERAL_PIVOT_VALUES > 6. INVALID_SYNTAX_FOR_CAST -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39275) Pass SQL config values as parameters of error classes
Max Gekk created SPARK-39275: Summary: Pass SQL config values as parameters of error classes Key: SPARK-39275 URL: https://issues.apache.org/jira/browse/SPARK-39275 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Pass SQL config values as parameters of error classes. This should align them SQL configs. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`
[ https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38726: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `how` parameter of `MultiIndex.dropna` > -- > > Key: SPARK-38726 > URL: https://issues.apache.org/jira/browse/SPARK-38726 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `how` parameter of `MultiIndex.dropna` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`
[ https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38765: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `inplace` parameter of `Series.clip` > -- > > Key: SPARK-38765 > URL: https://issues.apache.org/jira/browse/SPARK-38765 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` parameter of `Series.clip` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`
[ https://issues.apache.org/jira/browse/SPARK-38837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38837: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `dropna` parameter of `SeriesGroupBy.value_counts` > > > Key: SPARK-38837 > URL: https://issues.apache.org/jira/browse/SPARK-38837 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > Implement `dropna` parameter of `SeriesGroupBy.value_counts` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`
[ https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38863: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `skipna` parameter of `DataFrame.all` > --- > > Key: SPARK-38863 > URL: https://issues.apache.org/jira/browse/SPARK-38863 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `skipna` parameter of `DataFrame.all`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38793: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` > > > Key: SPARK-38793 > URL: https://issues.apache.org/jira/browse/SPARK-38793 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
[ https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38903: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` > > > Key: SPARK-38903 > URL: https://issues.apache.org/jira/browse/SPARK-38903 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.
[ https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38890: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame.sort_index`. > --- > > Key: SPARK-38890 > URL: https://issues.apache.org/jira/browse/SPARK-38890 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.sort_index`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`
[ https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38938: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `inplace` and `columns` parameters of `Series.drop` > - > > Key: SPARK-38938 > URL: https://issues.apache.org/jira/browse/SPARK-38938 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` and `columns` parameters of `Series.drop` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
[ https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38989: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame/Series.sample` > - > > Key: SPARK-38989 > URL: https://issues.apache.org/jira/browse/SPARK-38989 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39201: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` > --- > > Key: SPARK-39201 > URL: https://issues.apache.org/jira/browse/SPARK-39201 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39201: - Issue Type: Improvement (was: Umbrella) > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` > --- > > Key: SPARK-39201 > URL: https://issues.apache.org/jira/browse/SPARK-39201 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39104) Null Pointer Exeption on unpersist call
[ https://issues.apache.org/jira/browse/SPARK-39104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39104: - Fix Version/s: 3.3.0 (was: 3.3.1) > Null Pointer Exeption on unpersist call > --- > > Key: SPARK-39104 > URL: https://issues.apache.org/jira/browse/SPARK-39104 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Denis >Assignee: Cheng Pan >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.4.0 > > > DataFrame.unpesist call fails wth NPE > > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) > at > org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231){code} > Looks like syncronization in required for > org.apache.spark.sql.execution.columnar.CachedRDDBuilder#isCachedColumnBuffersLoaded > > {code:java} > def isCachedColumnBuffersLoaded: Boolean = { > _cachedColumnBuffers != null && isCachedRDDLoaded > } > def isCachedRDDLoaded: Boolean = { > _cachedColumnBuffersAreLoaded || { > val bmMaster = SparkEnv.get.blockManager.master > val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => > bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, > partition.index), false) > .exists { case(_, blockStatus) => blockStatus.isCached } > } > if (rddLoaded) { > _cachedColumnBuffersAreLoaded = rddLoaded > } > rddLoaded > } > } {code} > isCachedRDDLoaded relies on _cachedColumnBuffers != null check while it can > be changed concurrently from other thread. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38681) Support nested generic case classes
[ https://issues.apache.org/jira/browse/SPARK-38681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38681: - Fix Version/s: 3.3.0 (was: 3.3.1) > Support nested generic case classes > --- > > Key: SPARK-38681 > URL: https://issues.apache.org/jira/browse/SPARK-38681 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Fix For: 3.3.0 > > > Spark fail to derive schemas when using nested case class with generic > parameters. > Example > {code:java} > case class GenericData[A]( > genericField: A) > {code} > Will derive a correct schema for `GenericData[Int]` but if the classes are > nested e.g. > {code:java} > case class NestedGeneric[T]( > generic: GenericData[T]) > {code} > it will fail to derive a schema for `NestedGeneric[Int]`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39187) Remove SparkIllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-39187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39187: - Fix Version/s: 3.3.0 (was: 3.3.1) > Remove SparkIllegalStateException > - > > Key: SPARK-39187 > URL: https://issues.apache.org/jira/browse/SPARK-39187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > Remove SparkIllegalStateException and replace it by IllegalStateException. > This will be consistent with other places where IllegalStateException is used. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39190) Provide query context for decimal precision overflow error when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39190: - Fix Version/s: 3.3.0 (was: 3.3.1) > Provide query context for decimal precision overflow error when WSCG is off > --- > > Key: SPARK-39190 > URL: https://issues.apache.org/jira/browse/SPARK-39190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39183) Upgrade Apache Xerces Java to 2.12.2
[ https://issues.apache.org/jira/browse/SPARK-39183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39183: - Fix Version/s: 3.3.0 (was: 3.3.1) > Upgrade Apache Xerces Java to 2.12.2 > > > Key: SPARK-39183 > URL: https://issues.apache.org/jira/browse/SPARK-39183 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.2.1 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.3.0, 3.2.2, 3.4.0 > > > [Infinite Loop in Apache Xerces Java > |https://github.com/advisories/GHSA-h65f-jvqw-m9fj] > There's a vulnerability within the Apache Xerces Java (XercesJ) XML parser > when handling specially crafted XML document payloads. This causes, the > XercesJ XML parser to wait in an infinite loop, which may sometimes consume > system resources for prolonged duration. This vulnerability is present within > XercesJ version 2.12.1 and the previous versions. > References > https://nvd.nist.gov/vuln/detail/CVE-2022-23437 > https://lists.apache.org/thread/6pjwm10bb69kq955fzr1n0nflnjd27dl > http://www.openwall.com/lists/oss-security/2022/01/24/3 > https://www.oracle.com/security-alerts/cpuapr2022.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39193) Improve the performance of inferring Timestamp type in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39193: - Fix Version/s: 3.3.0 (was: 3.3.1) > Improve the performance of inferring Timestamp type in JSON/CSV data source > --- > > Key: SPARK-39193 > URL: https://issues.apache.org/jira/browse/SPARK-39193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > When reading JSON/CSV files with inferring timestamp types > `.option("inferTimestamp", true)`, the Timestamp conversion will throw and > catch exceptions. As we are putting decent error messages in the exception, > the creation of the exceptions is actually not cheap. It consumes more than > 90% of the type inference time. > We can use the parsing methods which return optional results instead. > Before the change, it takes 166 seconds to infer a JSON file of 624MB with > inferring timestamp enabled. > After the change, it only 16 seconds. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity
[ https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39240: - Fix Version/s: 3.3.0 (was: 3.3.1) > Source and binary releases using different tool to generates hashes for > integrity > - > > Key: SPARK-39240 > URL: https://issues.apache.org/jira/browse/SPARK-39240 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Trivial > Fix For: 3.3.0, 3.2.2 > > > shasum for source > gpg for binary -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39216) Do not collapse projects in CombineUnions if it hasCorrelatedSubquery
[ https://issues.apache.org/jira/browse/SPARK-39216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39216: - Fix Version/s: 3.3.0 (was: 3.3.1) > Do not collapse projects in CombineUnions if it hasCorrelatedSubquery > - > > Key: SPARK-39216 > URL: https://issues.apache.org/jira/browse/SPARK-39216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.3.0 > > > > SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't > collapse projects that contain correlated subqueries since haven't been > de-correlated (PullupCorrelatedPredicates). > Here is a simple example to reproduce this issue > {code:java} > SELECT (SELECT IF(x, 1, 0)) AS a > FROM (SELECT true) t(x) > UNION > SELECT 1 AS a {code} > Exception: > {code:java} > java.lang.IllegalStateException: Couldn't find x#4 in [] {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39214) Improve errors related to CAST
[ https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39214: - Fix Version/s: 3.3.0 (was: 3.3.1) > Improve errors related to CAST > -- > > Key: SPARK-39214 > URL: https://issues.apache.org/jira/browse/SPARK-39214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW > to make more precise and clear. > 2. Improve error messages of the error classes (use quotes for SQL config and > function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38506) Push partial aggregation through join
[ https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541486#comment-17541486 ] Yuming Wang edited comment on SPARK-38506 at 5/24/22 2:40 PM: -- Benchmark result after [this commit|https://github.com/apache/spark/pull/36552/commits/d029bb2c7c003dff28e3af940fb06cd0b14fc6cb]: |SQL|Before(ms)|With Partial Aggregation Optimization(ms)| |v1.4 q4|72478|61261| |v1.4 q5|23235|20971| |v1.4 q10|13406|8422| |v1.4 q11|37975|24236| |v1.4 q14a|154484|52502| |v1.4 q14b|128712|57363| |v1.4 q23a|153233|58932| |v1.4 q23b|159162|78401| |v1.4 q24a|392441|84826| |v1.4 q24b|408129|76384| |v1.4 q31|14696|13766| |v1.4 q35|29005|17662| |v1.4 q37|20076|9218| |v1.4 q47|36560|31299| |v1.4 q54|12283|11306| |v1.4 q57|38530|35303| |v1.4 q69|15839|11968| |v1.4 q82|24498|13451| |v1.4 q95|69196|42653| |v2.7 q6|9129|10527| |v2.7 q10a|11778|9909| |v2.7 q11|40113|29130| |v2.7 q14|159807|38052| |v2.7 q14a|238149|128097| |v2.7 q22a|9344|5269| |v2.7 q35|36910|14705| |v2.7 q35a|32793|13303| |v2.7 q47|49689|27308| |v2.7 q57|26016|28775| |v2.7 q74|42607|19340| |modifiedQueries q10|11675|8628| |modifiedQueries q98|6877|5219| was (Author: q79969786): Benchmark result: |SQL|Before(ms)|With Partial Aggregation Optimization(ms)| |v1.4 q4|72478|61261| |v1.4 q5|23235|20971| |v1.4 q10|13406|8422| |v1.4 q11|37975|24236| |v1.4 q14a|154484|52502| |v1.4 q14b|128712|57363| |v1.4 q23a|153233|58932| |v1.4 q23b|159162|78401| |v1.4 q24a|392441|84826| |v1.4 q24b|408129|76384| |v1.4 q31|14696|13766| |v1.4 q35|29005|17662| |v1.4 q37|20076|9218| |v1.4 q47|36560|31299| |v1.4 q54|12283|11306| |v1.4 q57|38530|35303| |v1.4 q69|15839|11968| |v1.4 q82|24498|13451| |v1.4 q95|69196|42653| |v2.7 q6|9129|10527| |v2.7 q10a|11778|9909| |v2.7 q11|40113|29130| |v2.7 q14|159807|38052| |v2.7 q14a|238149|128097| |v2.7 q22a|9344|5269| |v2.7 q35|36910|14705| |v2.7 q35a|32793|13303| |v2.7 q47|49689|27308| |v2.7 q57|26016|28775| |v2.7 q74|42607|19340| |modifiedQueries q10|11675|8628| |modifiedQueries q98|6877|5219| > Push partial aggregation through join > - > > Key: SPARK-38506 > URL: https://issues.apache.org/jira/browse/SPARK-38506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Please see > https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization > for more details. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39274) AttributeError: 'datetime.time' object has no attribute 'timetuple'
Andreas Fried created SPARK-39274: - Summary: AttributeError: 'datetime.time' object has no attribute 'timetuple' Key: SPARK-39274 URL: https://issues.apache.org/jira/browse/SPARK-39274 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1 Reporter: Andreas Fried {code:java} import pandas as pd import datetime pdf = pd.DataFrame({'naive': [datetime.time(11, 30, 33, 0)]}) print(pdf) print(pdf.info()) from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sp_df2 = spark.createDataFrame(pdf) sp_df2.show(10) {code} throws this error: {code:java} naive 0 11:30:33 RangeIndex: 1 entries, 0 to 0 Data columns (total 1 columns): # Column Non-Null Count Dtype --- -- -- - 0 naive 1 non-null object dtypes: object(1) memory usage: 136.0+ bytes None --- AttributeErrorTraceback (most recent call last) /usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py in 10 spark = SparkSession.builder.getOrCreate() 11 ---> 12 sp_df2 = spark.createDataFrame(pdf) 13 sp_df2.show(10) /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 671 if has_pandas and isinstance(data, pandas.DataFrame): 672 # Create a DataFrame from pandas DataFrame. --> 673 return super(SparkSession, self).createDataFrame( 674 data, schema, samplingRatio, verifySchema) 675 return self._create_dataframe(data, schema, samplingRatio, verifySchema) /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 338 raise 339 data = self._convert_from_pandas(data, schema, timezone) --> 340 return self._create_dataframe(data, schema, samplingRatio, verifySchema) 341 342 def _convert_from_pandas(self, pdf, schema, timezone): /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in _create_dataframe(self, data, schema, samplingRatio, verifySchema) 698 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 699 else: --> 700 rdd, schema = self._createFromLocal(map(prepare, data), schema) 701 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 702 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in _createFromLocal(self, data, schema) 523 524 # convert python objects to sql data --> 525 data = [schema.toInternal(row) for row in data] 526 return self._sc.parallelize(data), schema 527 /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/session.py in (.0) 523 524 # convert python objects to sql data --> 525 data = [schema.toInternal(row) for row in data] 526 return self._sc.parallelize(data), schema 527 /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, obj) 624 for n, f, c in zip(self.names, self.fields, self._needConversion)) 625 elif isinstance(obj, (tuple, list)): --> 626 return tuple(f.toInternal(v) if c else v 627 for f, v, c in zip(self.fields, obj, self._needConversion)) 628 elif hasattr(obj, "__dict__"): /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in (.0) 624 for n, f, c in zip(self.names, self.fields, self._needConversion)) 625 elif isinstance(obj, (tuple, list)): --> 626 return tuple(f.toInternal(v) if c else v 627 for f, v, c in zip(self.fields, obj, self._needConversion)) 628 elif hasattr(obj, "__dict__"): /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, obj) 449 450 def toInternal(self, obj): --> 451 return self.dataType.toInternal(obj) 452 453 def fromInternal(self, obj): /opt/ibm/spark/python/lib/pyspark.zip/pyspark/sql/types.py in toInternal(self, dt) 180 if dt is not None: 181 seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo --> 182else time.mktime(dt.timetuple())) 183 return int(seconds) * 100 + dt.microsecond 184 AttributeError: 'datetime.time' object has no attribute 'timetuple'{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) -
[jira] [Resolved] (SPARK-39256) Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO
[ https://issues.apache.org/jira/browse/SPARK-39256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39256. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36636 [https://github.com/apache/spark/pull/36636] > Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO > -- > > Key: SPARK-39256 > URL: https://issues.apache.org/jira/browse/SPARK-39256 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > JavaUtils#deleteRecursivelyUsingJavaIO do multiple file attribute calls, can > use `Files.readAttributes` merge these to one -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39256) Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO
[ https://issues.apache.org/jira/browse/SPARK-39256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39256: Assignee: Yang Jie > Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO > -- > > Key: SPARK-39256 > URL: https://issues.apache.org/jira/browse/SPARK-39256 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > JavaUtils#deleteRecursivelyUsingJavaIO do multiple file attribute calls, can > use `Files.readAttributes` merge these to one -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38506) Push partial aggregation through join
[ https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541486#comment-17541486 ] Yuming Wang commented on SPARK-38506: - Benchmark result: |SQL|Before(ms)|With Partial Aggregation Optimization(ms)| |v1.4 q4|72478|61261| |v1.4 q5|23235|20971| |v1.4 q10|13406|8422| |v1.4 q11|37975|24236| |v1.4 q14a|154484|52502| |v1.4 q14b|128712|57363| |v1.4 q23a|153233|58932| |v1.4 q23b|159162|78401| |v1.4 q24a|392441|84826| |v1.4 q24b|408129|76384| |v1.4 q31|14696|13766| |v1.4 q35|29005|17662| |v1.4 q37|20076|9218| |v1.4 q47|36560|31299| |v1.4 q54|12283|11306| |v1.4 q57|38530|35303| |v1.4 q69|15839|11968| |v1.4 q82|24498|13451| |v1.4 q95|69196|42653| |v2.7 q6|9129|10527| |v2.7 q10a|11778|9909| |v2.7 q11|40113|29130| |v2.7 q14|159807|38052| |v2.7 q14a|238149|128097| |v2.7 q22a|9344|5269| |v2.7 q35|36910|14705| |v2.7 q35a|32793|13303| |v2.7 q47|49689|27308| |v2.7 q57|26016|28775| |v2.7 q74|42607|19340| |modifiedQueries q10|11675|8628| |modifiedQueries q98|6877|5219| > Push partial aggregation through join > - > > Key: SPARK-38506 > URL: https://issues.apache.org/jira/browse/SPARK-38506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Please see > https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization > for more details. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org