[jira] [Created] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Rui Wang (Jira)
Rui Wang created SPARK-40971:


 Summary: Imports more from connect proto package to avoid calling 
`proto.` for Connect DSL
 Key: SPARK-40971
 URL: https://issues.apache.org/jira/browse/SPARK-40971
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626440#comment-17626440
 ] 

Apache Spark commented on SPARK-40971:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38445

> Imports more from connect proto package to avoid calling `proto.` for Connect 
> DSL
> -
>
> Key: SPARK-40971
> URL: https://issues.apache.org/jira/browse/SPARK-40971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40971:


Assignee: Apache Spark

> Imports more from connect proto package to avoid calling `proto.` for Connect 
> DSL
> -
>
> Key: SPARK-40971
> URL: https://issues.apache.org/jira/browse/SPARK-40971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40971:


Assignee: (was: Apache Spark)

> Imports more from connect proto package to avoid calling `proto.` for Connect 
> DSL
> -
>
> Key: SPARK-40971
> URL: https://issues.apache.org/jira/browse/SPARK-40971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)
Mingming Ge created SPARK-40972:
---

 Summary: OptimizeLocalShuffleReader causing data skew
 Key: SPARK-40972
 URL: https://issues.apache.org/jira/browse/SPARK-40972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Mingming Ge


!image-2022-10-31-15-49-36-559.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Description: !image-2022-10-31-15-50-36-435.png!  (was: 
!image-2022-10-31-15-49-36-559.png!)

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png
>
>
> !image-2022-10-31-15-50-36-435.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Attachment: image-2022-10-31-15-50-36-435.png

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png
>
>
> !image-2022-10-31-15-49-36-559.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40973) Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT

2022-10-31 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626449#comment-17626449
 ] 

Haejoon Lee commented on SPARK-40973:
-

I'm working on it

> Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT
> 
>
> Key: SPARK-40973
> URL: https://issues.apache.org/jira/browse/SPARK-40973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Update `_LEGACY_ERROR_TEMP_0055` error class to use proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40973) Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT

2022-10-31 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40973:
---

 Summary: Rename _LEGACY_ERROR_TEMP_0055 to 
UNCLOSED_BRACKETED_COMMENT
 Key: SPARK-40973
 URL: https://issues.apache.org/jira/browse/SPARK-40973
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Update `_LEGACY_ERROR_TEMP_0055` error class to use proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Attachment: image-2022-10-31-15-51-39-430.png

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png
>
>
> !image-2022-10-31-15-50-36-435.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Description: 
 

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!

  was:!image-2022-10-31-15-50-36-435.png!


> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png
>
>
>  
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Attachment: image-2022-10-31-15-53-19-751.png

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png
>
>
> !image-2022-10-31-15-50-36-435.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Description: 
Because there are many empty files in the table, the partition num of 
OptimizeLocalShuffleReader to optimize shuffle is 1

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!

  was:
 

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!


> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png
>
>
> Because there are many empty files in the table, the partition num of 
> OptimizeLocalShuffleReader to optimize shuffle is 1
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Attachment: image-2022-10-31-15-57-41-599.png

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png, 
> image-2022-10-31-15-57-41-599.png
>
>
> Because there are many empty files in the table, the partition num of 
> OptimizeLocalShuffleReader to optimize shuffle is 1
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Description: 
Because there are many empty files in the table, the partition num of 
OptimizeLocalShuffleReader to optimize shuffle is 1

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-57-41-599.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!

  was:
Because there are many empty files in the table, the partition num of 
OptimizeLocalShuffleReader to optimize shuffle is 1

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!


> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png, 
> image-2022-10-31-15-57-41-599.png
>
>
> Because there are many empty files in the table, the partition num of 
> OptimizeLocalShuffleReader to optimize shuffle is 1
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-57-41-599.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Mingming Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingming Ge updated SPARK-40972:

Description: 
Because there are many empty files in the table, the partition num of 
OptimizeLocalShuffleReader to optimize partition num is 1

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-57-41-599.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!

  was:
Because there are many empty files in the table, the partition num of 
OptimizeLocalShuffleReader to optimize shuffle is 1

!image-2022-10-31-15-53-19-751.png!

!image-2022-10-31-15-57-41-599.png!

!image-2022-10-31-15-50-36-435.png!

 

 

!image-2022-10-31-15-51-39-430.png!


> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png, 
> image-2022-10-31-15-57-41-599.png
>
>
> Because there are many empty files in the table, the partition num of 
> OptimizeLocalShuffleReader to optimize partition num is 1
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-57-41-599.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40794) Upgrade Netty from 4.1.80 to 4.1.84

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626455#comment-17626455
 ] 

Apache Spark commented on SPARK-40794:
--

User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446

> Upgrade Netty from 4.1.80 to 4.1.84
> ---
>
> Key: SPARK-40794
> URL: https://issues.apache.org/jira/browse/SPARK-40794
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> * https://netty.io/news/2022/09/08/4-1-81-Final.html
>  * https://netty.io/news/2022/09/13/4-1-82-Final.html
>  * https://netty.io/news/2022/10/11/4-1-84-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40973) Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40973:


Assignee: (was: Apache Spark)

> Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT
> 
>
> Key: SPARK-40973
> URL: https://issues.apache.org/jira/browse/SPARK-40973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Update `_LEGACY_ERROR_TEMP_0055` error class to use proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40973) Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40973:


Assignee: Apache Spark

> Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT
> 
>
> Key: SPARK-40973
> URL: https://issues.apache.org/jira/browse/SPARK-40973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Update `_LEGACY_ERROR_TEMP_0055` error class to use proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40973) Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626456#comment-17626456
 ] 

Apache Spark commented on SPARK-40973:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38447

> Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT
> 
>
> Key: SPARK-40973
> URL: https://issues.apache.org/jira/browse/SPARK-40973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Update `_LEGACY_ERROR_TEMP_0055` error class to use proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Omar Ismail (Jira)
Omar Ismail created SPARK-40974:
---

 Summary: EXPODE function selects outer column
 Key: SPARK-40974
 URL: https://issues.apache.org/jira/browse/SPARK-40974
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Omar Ismail


Im trying to determine if indirectly selecting an outer column is a bug or an 
intended feature of the EXPLODE function. 

 

If I run the following SQL statement:

```

SELECT

  (SELECT FIRST(name_element_)

FROM LATERAL VIEW EXPLODE(name) AS name_element_

   *)*

FROM patient

```

 

it fails with:

```

Accessing outer query column is not allowed in:

Generate explode(outer(name#9628))

```

 

However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
runs:

```

SELECT(

SELECT FIRST(name_element_)

FROM (SELECT EXPLODE(name_element_) AS name_element_ 

  \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*

        **        )

 )

FROM patient

```

>From the viewpoint of the EXPLODE function, it seems like the column 
>name_element_ does not come from an outer column. Is this an intended feature 
>or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626489#comment-17626489
 ] 

Apache Spark commented on SPARK-40974:
--

User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446

> EXPODE function selects outer column
> 
>
> Key: SPARK-40974
> URL: https://issues.apache.org/jira/browse/SPARK-40974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Omar Ismail
>Priority: Minor
>
> Im trying to determine if indirectly selecting an outer column is a bug or an 
> intended feature of the EXPLODE function. 
>  
> If I run the following SQL statement:
> ```
> SELECT
>   (SELECT FIRST(name_element_)
> FROM LATERAL VIEW EXPLODE(name) AS name_element_
>    *)*
> FROM patient
> ```
>  
> it fails with:
> ```
> Accessing outer query column is not allowed in:
> Generate explode(outer(name#9628))
> ```
>  
> However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
> runs:
> ```
> SELECT(
> SELECT FIRST(name_element_)
> FROM (SELECT EXPLODE(name_element_) AS name_element_ 
>   \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*
>         **        )
>  )
> FROM patient
> ```
> From the viewpoint of the EXPLODE function, it seems like the column 
> name_element_ does not come from an outer column. Is this an intended feature 
> or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40974:


Assignee: (was: Apache Spark)

> EXPODE function selects outer column
> 
>
> Key: SPARK-40974
> URL: https://issues.apache.org/jira/browse/SPARK-40974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Omar Ismail
>Priority: Minor
>
> Im trying to determine if indirectly selecting an outer column is a bug or an 
> intended feature of the EXPLODE function. 
>  
> If I run the following SQL statement:
> ```
> SELECT
>   (SELECT FIRST(name_element_)
> FROM LATERAL VIEW EXPLODE(name) AS name_element_
>    *)*
> FROM patient
> ```
>  
> it fails with:
> ```
> Accessing outer query column is not allowed in:
> Generate explode(outer(name#9628))
> ```
>  
> However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
> runs:
> ```
> SELECT(
> SELECT FIRST(name_element_)
> FROM (SELECT EXPLODE(name_element_) AS name_element_ 
>   \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*
>         **        )
>  )
> FROM patient
> ```
> From the viewpoint of the EXPLODE function, it seems like the column 
> name_element_ does not come from an outer column. Is this an intended feature 
> or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40974:


Assignee: Apache Spark

> EXPODE function selects outer column
> 
>
> Key: SPARK-40974
> URL: https://issues.apache.org/jira/browse/SPARK-40974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Omar Ismail
>Assignee: Apache Spark
>Priority: Minor
>
> Im trying to determine if indirectly selecting an outer column is a bug or an 
> intended feature of the EXPLODE function. 
>  
> If I run the following SQL statement:
> ```
> SELECT
>   (SELECT FIRST(name_element_)
> FROM LATERAL VIEW EXPLODE(name) AS name_element_
>    *)*
> FROM patient
> ```
>  
> it fails with:
> ```
> Accessing outer query column is not allowed in:
> Generate explode(outer(name#9628))
> ```
>  
> However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
> runs:
> ```
> SELECT(
> SELECT FIRST(name_element_)
> FROM (SELECT EXPLODE(name_element_) AS name_element_ 
>   \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*
>         **        )
>  )
> FROM patient
> ```
> From the viewpoint of the EXPLODE function, it seems like the column 
> name_element_ does not come from an outer column. Is this an intended feature 
> or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40972) OptimizeLocalShuffleReader causing data skew

2022-10-31 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626490#comment-17626490
 ] 

Yuming Wang commented on SPARK-40972:
-

cc [~michaelzhang-db]

> OptimizeLocalShuffleReader causing data skew
> 
>
> Key: SPARK-40972
> URL: https://issues.apache.org/jira/browse/SPARK-40972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Mingming Ge
>Priority: Major
> Attachments: image-2022-10-31-15-50-36-435.png, 
> image-2022-10-31-15-51-39-430.png, image-2022-10-31-15-53-19-751.png, 
> image-2022-10-31-15-57-41-599.png
>
>
> Because there are many empty files in the table, the partition num of 
> OptimizeLocalShuffleReader to optimize partition num is 1
> !image-2022-10-31-15-53-19-751.png!
> !image-2022-10-31-15-57-41-599.png!
> !image-2022-10-31-15-50-36-435.png!
>  
>  
> !image-2022-10-31-15-51-39-430.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40975) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021

2022-10-31 Thread Max Gekk (Jira)
Max Gekk created SPARK-40975:


 Summary: Assign a name to the legacy error class 
_LEGACY_ERROR_TEMP_0021
 Key: SPARK-40975
 URL: https://issues.apache.org/jira/browse/SPARK-40975
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40975) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40975:


Assignee: Apache Spark  (was: Max Gekk)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021
> ---
>
> Key: SPARK-40975
> URL: https://issues.apache.org/jira/browse/SPARK-40975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40975) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626532#comment-17626532
 ] 

Apache Spark commented on SPARK-40975:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38448

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021
> ---
>
> Key: SPARK-40975
> URL: https://issues.apache.org/jira/browse/SPARK-40975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40975) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40975:


Assignee: Max Gekk  (was: Apache Spark)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021
> ---
>
> Key: SPARK-40975
> URL: https://issues.apache.org/jira/browse/SPARK-40975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40975) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626533#comment-17626533
 ] 

Apache Spark commented on SPARK-40975:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38448

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021
> ---
>
> Key: SPARK-40975
> URL: https://issues.apache.org/jira/browse/SPARK-40975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40971.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38445
[https://github.com/apache/spark/pull/38445]

> Imports more from connect proto package to avoid calling `proto.` for Connect 
> DSL
> -
>
> Key: SPARK-40971
> URL: https://issues.apache.org/jira/browse/SPARK-40971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40971) Imports more from connect proto package to avoid calling `proto.` for Connect DSL

2022-10-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40971:
---

Assignee: Rui Wang

> Imports more from connect proto package to avoid calling `proto.` for Connect 
> DSL
> -
>
> Key: SPARK-40971
> URL: https://issues.apache.org/jira/browse/SPARK-40971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626556#comment-17626556
 ] 

Apache Spark commented on SPARK-40798:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38449

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40663) Migrate execution errors onto error classes

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626557#comment-17626557
 ] 

Apache Spark commented on SPARK-40663:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38450

> Migrate execution errors onto error classes
> ---
>
> Key: SPARK-40663
> URL: https://issues.apache.org/jira/browse/SPARK-40663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the execution exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40663) Migrate execution errors onto error classes

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626558#comment-17626558
 ] 

Apache Spark commented on SPARK-40663:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38450

> Migrate execution errors onto error classes
> ---
>
> Key: SPARK-40663
> URL: https://issues.apache.org/jira/browse/SPARK-40663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the execution exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34210) Cannot create a record reader because of a previous error when spark accesses the hive on HBase table

2022-10-31 Thread Mehul Thakkar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626576#comment-17626576
 ] 

Mehul Thakkar commented on SPARK-34210:
---

Do you mean we have to download the spark source code from master branch and 
update the code with the fix to make it working for Spark 3? 

> Cannot create a record reader because of a previous error when spark accesses 
> the hive on HBase table 
> --
>
> Key: SPARK-34210
> URL: https://issues.apache.org/jira/browse/SPARK-34210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: zhangzhanchang
>Priority: Major
>
> It is normal for version 2.4.6 to use spark SQL to access hive on HBase 
> table,Upgrade to spark3.0.1 with the following exception:
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:252)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
> Caused by: java.lang.IllegalStateException: The input format instance has not 
> been properly initialized. Ensure you call initializeTable either in your 
> constructor or initialize method
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:585)
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:247)
>  ... 59 more
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34210) Cannot create a record reader because of a previous error when spark accesses the hive on HBase table

2022-10-31 Thread Mehul Thakkar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626576#comment-17626576
 ] 

Mehul Thakkar edited comment on SPARK-34210 at 10/31/22 12:55 PM:
--

Do you mean we have to download the spark source code from master branch and 
update the code with the fix to make it working for Spark 3? 

 

Could you please elaborate more on bug in Hadoop?


was (Author: JIRAUSER297345):
Do you mean we have to download the spark source code from master branch and 
update the code with the fix to make it working for Spark 3? 

> Cannot create a record reader because of a previous error when spark accesses 
> the hive on HBase table 
> --
>
> Key: SPARK-34210
> URL: https://issues.apache.org/jira/browse/SPARK-34210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: zhangzhanchang
>Priority: Major
>
> It is normal for version 2.4.6 to use spark SQL to access hive on HBase 
> table,Upgrade to spark3.0.1 with the following exception:
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:252)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
> Caused by: java.lang.IllegalStateException: The input format instance has not 
> been properly initialized. Ensure you call initializeTable either in your 
> constructor or initialize method
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:585)
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:247)
>  ... 59 more
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Yang Jie (Jira)
Yang Jie created SPARK-40976:


 Summary: Upgrade sbt to 1.7.3
 Key: SPARK-40976
 URL: https://issues.apache.org/jira/browse/SPARK-40976
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40976:


Assignee: (was: Apache Spark)

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626630#comment-17626630
 ] 

Apache Spark commented on SPARK-40976:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38451

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40976:


Assignee: Apache Spark

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626631#comment-17626631
 ] 

Apache Spark commented on SPARK-40976:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38451

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626635#comment-17626635
 ] 

Nikhil Sharma edited comment on SPARK-33807 at 10/31/22 3:09 PM:
-

Thank you for sharing such good information. Very informative and effective 
post. 

[react native certification| 
{+}[https://www.igmguru.com/digital-marketing-programming/react-native-training/]{+}]


was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[react native 
certification|+[https://www.igmguru.com/digital-marketing-programming/react-native-training/]+]

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626635#comment-17626635
 ] 

Nikhil Sharma commented on SPARK-33807:
---

Thank you for sharing such good information. Very informative and effective 
post. 

[react native 
certification|+[https://www.igmguru.com/digital-marketing-programming/react-native-training/]+]

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626635#comment-17626635
 ] 

Nikhil Sharma edited comment on SPARK-33807 at 10/31/22 3:10 PM:
-

Thank you for sharing such good information. Very informative and effective 
post. 

[react native 
certification]({+}[https://www.igmguru.com/digital-marketing-programming/react-native-training/)|https://www.igmguru.com/digital-marketing-programming/react-native-training/]{+}


was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[react native certification| 
{+}[https://www.igmguru.com/digital-marketing-programming/react-native-training/]{+}]

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626635#comment-17626635
 ] 

Nikhil Sharma edited comment on SPARK-33807 at 10/31/22 3:10 PM:
-

Thank you for sharing such good information. Very informative and effective 
post. 

+[https://www.igmguru.com/digital-marketing-programming/react-native-training/]+


was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[react native 
certification]({+}[https://www.igmguru.com/digital-marketing-programming/react-native-training/)|https://www.igmguru.com/digital-marketing-programming/react-native-training/]{+}

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626642#comment-17626642
 ] 

Apache Spark commented on SPARK-40974:
--

User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446

> EXPODE function selects outer column
> 
>
> Key: SPARK-40974
> URL: https://issues.apache.org/jira/browse/SPARK-40974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Omar Ismail
>Priority: Minor
>
> Im trying to determine if indirectly selecting an outer column is a bug or an 
> intended feature of the EXPLODE function. 
>  
> If I run the following SQL statement:
> ```
> SELECT
>   (SELECT FIRST(name_element_)
> FROM LATERAL VIEW EXPLODE(name) AS name_element_
>    *)*
> FROM patient
> ```
>  
> it fails with:
> ```
> Accessing outer query column is not allowed in:
> Generate explode(outer(name#9628))
> ```
>  
> However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
> runs:
> ```
> SELECT(
> SELECT FIRST(name_element_)
> FROM (SELECT EXPLODE(name_element_) AS name_element_ 
>   \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*
>         **        )
>  )
> FROM patient
> ```
> From the viewpoint of the EXPLODE function, it seems like the column 
> name_element_ does not come from an outer column. Is this an intended feature 
> or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40974) EXPODE function selects outer column

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626644#comment-17626644
 ] 

Apache Spark commented on SPARK-40974:
--

User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446

> EXPODE function selects outer column
> 
>
> Key: SPARK-40974
> URL: https://issues.apache.org/jira/browse/SPARK-40974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Omar Ismail
>Priority: Minor
>
> Im trying to determine if indirectly selecting an outer column is a bug or an 
> intended feature of the EXPLODE function. 
>  
> If I run the following SQL statement:
> ```
> SELECT
>   (SELECT FIRST(name_element_)
> FROM LATERAL VIEW EXPLODE(name) AS name_element_
>    *)*
> FROM patient
> ```
>  
> it fails with:
> ```
> Accessing outer query column is not allowed in:
> Generate explode(outer(name#9628))
> ```
>  
> However, if I do a "cheeky select" (bolded below), the SQL query is valid and 
> runs:
> ```
> SELECT(
> SELECT FIRST(name_element_)
> FROM (SELECT EXPLODE(name_element_) AS name_element_ 
>   \{*}FROM ({*}{*}SELECT{*} *name AS name_element_)*
>         **        )
>  )
> FROM patient
> ```
> From the viewpoint of the EXPLODE function, it seems like the column 
> name_element_ does not come from an outer column. Is this an intended feature 
> or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40916) udf could not filter null value cause npe

2022-10-31 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-40916:

Description: 
{code:sql}
select
t22.uid,
from
(
SELECT
code,
count(distinct uid) cnt
FROM
(
SELECT
uid,
code,
lng,
lat
FROM
(
select
 
riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
 as code,
uid,
lng,
lat,
dt as event_time 
from
(
select
param['timestamp'] as dt,

get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log
and 
get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
 
)a
and lng is not null
and lat is not null
) t2
group by uid,code,lng,lat
) t1
GROUP BY code having count(DISTINCT uid)>=10
)t11
join
(
SELECT
uid,
code,
lng,
lat
FROM
(
select

riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
 as code,
uid,
lng,
lat,
dt as event_time
from
(
select
param['timestamp'] as dt,

get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng, 

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log 
and 
get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
 
)a
and lng is not null
and lat is not null
) t2
where substr(code,0,6)<>'wx4ey3'
group by uid,code,lng,lat
) t22 on t11.code=t22.code
group by t22.uid
{code}
this sql can't run because 
`riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)`
 will throw npe(`Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
Unable to execute method public java.lang.String 
com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
 with arguments {null,null,8}:null`), but I have filter null in my condition, 
the udf of manhattan_dw.aes_decode will return null if lng or lat is null, *but 
after I remove `where substr(code,0,6)<>'wx4ey3' `this condition, it can run 
normally.* 


complete :
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute 
method public java.lang.String 
com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
 with arguments {null,null,8}:null
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1049)
at org.apache.spark.sql.hive.HiveSimpleUDF.eval(hiveUDFs.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_3$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:275)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:274)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:515)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)

  was:
```
select
t22.uid,
from
(
SELECT
code,
count(distinct uid) cnt
FROM
(
SELECT
uid,
code,
lng,
lat
FROM
(
select
 
riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
 as code,
uid,
lng,
lat,
dt as event_time 
from
(
select
param['timestamp'] as dt,

get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,
 

[jira] [Commented] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626743#comment-17626743
 ] 

Apache Spark commented on SPARK-40802:
--

User 'Mingli-Rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/38452

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626742#comment-17626742
 ] 

Apache Spark commented on SPARK-40802:
--

User 'Mingli-Rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/38452

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40802:


Assignee: (was: Apache Spark)

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40802:


Assignee: Apache Spark

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40569) Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626753#comment-17626753
 ] 

Vivek Garg commented on SPARK-40569:


The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[Salesforce Marketing Cloud 
Certification|[https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]]
 Course credential is intended for people who want to show that they have 
knowledge, expertise, and experience in the following areas: best practices for 
email marketing, message design, subscriber and data management, inbox 
delivery, email automation, and tracking and reporting metrics within the 
Marketing Cloud Email application.

> Add smoke test in standalone cluster for spark-docker
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40569) Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626753#comment-17626753
 ] 

Vivek Garg edited comment on SPARK-40569 at 10/31/22 6:38 PM:
--

The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[[https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]Salesforce
 Marketing Cloud Certification] Course credential is intended for people who 
want to show that they have knowledge, expertise, and experience in the 
following areas: best practices for email marketing, message design, subscriber 
and data management, inbox delivery, email automation, and tracking and 
reporting metrics within the Marketing Cloud Email application.


was (Author: JIRAUSER294516):
The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[Salesforce Marketing Cloud 
Certification|[https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]]
 Course credential is intended for people who want to show that they have 
knowledge, expertise, and experience in the following areas: best practices for 
email marketing, message design, subscriber and data management, inbox 
delivery, email automation, and tracking and reporting metrics within the 
Marketing Cloud Email application.

> Add smoke test in standalone cluster for spark-docker
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40569) Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626753#comment-17626753
 ] 

Vivek Garg edited comment on SPARK-40569 at 10/31/22 6:38 PM:
--

https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP
 analytics cloud training
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP 
analytics cloud training)
(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP
 analytics cloud training[/url]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]

 


was (Author: JIRAUSER294516):
The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[[https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]Salesforce
 Marketing Cloud Certification] Course credential is intended for people who 
want to show that they have knowledge, expertise, and experience in the 
following areas: best practices for email marketing, message design, subscriber 
and data management, inbox delivery, email automation, and tracking and 
reporting metrics within the Marketing Cloud Email application.

> Add smoke test in standalone cluster for spark-docker
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40569) Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626753#comment-17626753
 ] 

Vivek Garg edited comment on SPARK-40569 at 10/31/22 6:39 PM:
--

The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[Salesforce Marketing Cloud 
Certification|[http://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]]
 Course credential is intended for people who want to show that they have 
knowledge, expertise, and experience in the following areas: best practices for 
email marketing, message design, subscriber and data management, inbox 
delivery, email automation, and tracking and reporting metrics within the 
Marketing Cloud Email application.


was (Author: JIRAUSER294516):
https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP
 analytics cloud training
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP 
analytics cloud training)
(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP
 analytics cloud training[/url]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]

 

> Add smoke test in standalone cluster for spark-docker
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-40569) Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread Vivek Garg (Jira)


[ https://issues.apache.org/jira/browse/SPARK-40569 ]


Vivek Garg deleted comment on SPARK-40569:


was (Author: JIRAUSER294516):
The Salesforce Marketing Cloud training offered by IgmGuru is created by 
instructors who are experts in the field using the most recent curriculum. The 
[Salesforce Marketing Cloud 
Certification|[http://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]]
 Course credential is intended for people who want to show that they have 
knowledge, expertise, and experience in the following areas: best practices for 
email marketing, message design, subscriber and data management, inbox 
delivery, email automation, and tracking and reporting metrics within the 
Marketing Cloud Email application.

> Add smoke test in standalone cluster for spark-docker
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626759#comment-17626759
 ] 

Vivek Garg commented on SPARK-33807:


Thank 
[you|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-10-31 Thread Vivek Garg (Jira)


[ https://issues.apache.org/jira/browse/SPARK-22588 ]


Vivek Garg deleted comment on SPARK-22588:


was (Author: JIRAUSER294516):
Thank 
[you|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626758#comment-17626758
 ] 

Vivek Garg commented on SPARK-22588:


Thank 
[you|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626759#comment-17626759
 ] 

Vivek Garg edited comment on SPARK-33807 at 10/31/22 6:42 PM:
--

Thank [Salesforce Marketing Cloud 
Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].


was (Author: JIRAUSER294516):
Thank 
[you|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626759#comment-17626759
 ] 

Vivek Garg edited comment on SPARK-33807 at 10/31/22 6:43 PM:
--

Great job. [Salesforce Marketing Cloud 
Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].


was (Author: JIRAUSER294516):
Thank [Salesforce Marketing Cloud 
Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2022-10-31 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626762#comment-17626762
 ] 

Vivek Garg commented on SPARK-23521:


IgmGuru [Mulesoft Online 
Training|https://www.igmguru.com/digital-marketing-programming/mulesoft-training/]
 is created with the Mulesoft certification exam in mind to ensure that the 
applicant passes the test on their first try.

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Chao Sun (Jira)


[ https://issues.apache.org/jira/browse/SPARK-33807 ]


Chao Sun deleted comment on SPARK-33807:
--

was (Author: JIRAUSER294516):
Great job. [Salesforce Marketing Cloud 
Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/].

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-33807) Data Source V2: Remove read specific distributions

2022-10-31 Thread Chao Sun (Jira)


[ https://issues.apache.org/jira/browse/SPARK-33807 ]


Chao Sun deleted comment on SPARK-33807:
--

was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

+[https://www.igmguru.com/digital-marketing-programming/react-native-training/]+

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40977) Complete Support for Union in Python client

2022-10-31 Thread Rui Wang (Jira)
Rui Wang created SPARK-40977:


 Summary: Complete Support for Union in Python client
 Key: SPARK-40977
 URL: https://issues.apache.org/jira/browse/SPARK-40977
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40977) Complete Support for Union in Python client

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626799#comment-17626799
 ] 

Apache Spark commented on SPARK-40977:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38453

> Complete Support for Union in Python client
> ---
>
> Key: SPARK-40977
> URL: https://issues.apache.org/jira/browse/SPARK-40977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40977) Complete Support for Union in Python client

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626800#comment-17626800
 ] 

Apache Spark commented on SPARK-40977:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38453

> Complete Support for Union in Python client
> ---
>
> Key: SPARK-40977
> URL: https://issues.apache.org/jira/browse/SPARK-40977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40977) Complete Support for Union in Python client

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40977:


Assignee: Apache Spark

> Complete Support for Union in Python client
> ---
>
> Key: SPARK-40977
> URL: https://issues.apache.org/jira/browse/SPARK-40977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40977) Complete Support for Union in Python client

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40977:


Assignee: (was: Apache Spark)

> Complete Support for Union in Python client
> ---
>
> Key: SPARK-40977
> URL: https://issues.apache.org/jira/browse/SPARK-40977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40947) Upgrade pandas to 1.5.1

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40947:
-

Assignee: Haejoon Lee

> Upgrade pandas to 1.5.1
> ---
>
> Key: SPARK-40947
> URL: https://issues.apache.org/jira/browse/SPARK-40947
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Pandas 1.5.1 is released, we should support latest pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40947) Upgrade pandas to 1.5.1

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40947.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38420
[https://github.com/apache/spark/pull/38420]

> Upgrade pandas to 1.5.1
> ---
>
> Key: SPARK-40947
> URL: https://issues.apache.org/jira/browse/SPARK-40947
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Pandas 1.5.1 is released, we should support latest pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40966) FIX `read_parquet` with `pandas_metadata`

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40966:
-

Assignee: Haejoon Lee

> FIX `read_parquet` with `pandas_metadata`
> -
>
> Key: SPARK-40966
> URL: https://issues.apache.org/jira/browse/SPARK-40966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> test_parquet_read_with_pandas_metadata is broken with pandas 1.5.1.
> should fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40966) FIX `read_parquet` with `pandas_metadata`

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40966.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38420
[https://github.com/apache/spark/pull/38420]

> FIX `read_parquet` with `pandas_metadata`
> -
>
> Key: SPARK-40966
> URL: https://issues.apache.org/jira/browse/SPARK-40966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> test_parquet_read_with_pandas_metadata is broken with pandas 1.5.1.
> should fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40976.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38451
[https://github.com/apache/spark/pull/38451]

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40976:
-

Assignee: Yang Jie

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Max Gekk (Jira)
Max Gekk created SPARK-40978:


 Summary: Migrate failAnalysis() w/o context onto error classes
 Key: SPARK-40978
 URL: https://issues.apache.org/jira/browse/SPARK-40978
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0


Call `failAnalysis()` with an error class instead of `failAnalysis()` w/ a 
message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40978:
-
Description: Call `failAnalysis()` w/o context but with an error class 
instead of `failAnalysis()` w/ a message.  (was: Call `failAnalysis()` with an 
error class instead of `failAnalysis()` w/ a message.)

> Migrate failAnalysis() w/o context onto error classes
> -
>
> Key: SPARK-40978
> URL: https://issues.apache.org/jira/browse/SPARK-40978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Call `failAnalysis()` w/o context but with an error class instead of 
> `failAnalysis()` w/ a message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-40979:
-

 Summary: Keep removed executor info in decommission state
 Key: SPARK-40979
 URL: https://issues.apache.org/jira/browse/SPARK-40979
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40979:
--
Reporter: Zhongwei Zhu  (was: Dongjoon Hyun)

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31776) Literal lit() supports lists and numpy arrays

2022-10-31 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626844#comment-17626844
 ] 

Xinrong Meng commented on SPARK-31776:
--

`lit` supports Python list and NumPy arrays in 
https://issues.apache.org/jira/browse/SPARK-39405 in Spark 3.4.0.

> Literal lit() supports lists and numpy arrays
> -
>
> Key: SPARK-31776
> URL: https://issues.apache.org/jira/browse/SPARK-31776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In ML workload, it is common to replace null feature vectors with some 
> default value. However, lit() does not support Python list and numpy arrays 
> at input. Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40979:


Assignee: (was: Apache Spark)

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40979:


Assignee: Apache Spark

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626845#comment-17626845
 ] 

Apache Spark commented on SPARK-40979:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/38441

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2022-10-31 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626847#comment-17626847
 ] 

Xinrong Meng commented on SPARK-6857:
-

Hi, we have NumPy input support 
https://issues.apache.org/jira/browse/SPARK-39405 in Spark 3.4.0.

> Python SQL schema inference should support numpy types
> --
>
> Key: SPARK-6857
> URL: https://issues.apache.org/jira/browse/SPARK-6857
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
> See discussion in comments.
> If you try to use SQL's schema inference to create a DataFrame out of a list 
> or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
> numpy types.  It would be handy if it did.
> E.g.:
> {code}
> import numpy
> from collections import namedtuple
> from pyspark.sql import SQLContext
> MyType = namedtuple('MyType', 'x')
> myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
> sqlContext = SQLContext(sc)
> data = sqlContext.createDataFrame(myValues)
> {code}
> The above code fails with:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
> createDataFrame
> return self.inferSchema(data, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
> inferSchema
> schema = self._inferSchema(rdd, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
> _inferSchema
> schema = _infer_schema(first)
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
> _infer_schema
> fields = [StructField(k, _infer_type(v), True) for k, v in items]
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
> _infer_type
> raise ValueError("not supported type: %s" % type(obj))
> ValueError: not supported type: 
> {code}
> But if we cast to int (not numpy types) first, it's OK:
> {code}
> myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
> data = sqlContext.createDataFrame(myNativeValues) # OK
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37697) Make it easier to convert numpy arrays to Spark Dataframes

2022-10-31 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626850#comment-17626850
 ] 

Xinrong Meng commented on SPARK-37697:
--

Hi, we have NumPy input support 
https://issues.apache.org/jira/browse/SPARK-39405 in Spark 3.4.0.

> Make it easier to convert numpy arrays to Spark Dataframes
> --
>
> Key: SPARK-37697
> URL: https://issues.apache.org/jira/browse/SPARK-37697
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Douglas Moore
>Priority: Major
>
> Make it easier to convert numpy arrays to dataframes.
> Often we receive errors:
>  
> {code:java}
> df = spark.createDataFrame(numpy.arange(10))
> Can not infer schema for type: 
> {code}
>  
> OR
> {code:java}
> df = spark.createDataFrame(numpy.arange(10.))
> Can not infer schema for type: 
> {code}
>  
> Today (Spark 3.x) we have to:
> {code:java}
> spark.createDataFrame(pd.DataFrame(numpy.arange(10.))) {code}
> Make this easier with a direct conversion from Numpy arrays to Spark 
> Dataframes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40979:
-
Description: 
Removed executor due to decommission should be kept in a separate set. To avoid 
OOM, set size will be limited to 1K or 10K.

FetchFailed caused by decom executor could be divided into 2 categories:
 # When FetchFailed reached DAGScheduler, the executor is still alive or is 
lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
handled in SPARK-40979
 # FetchFailed is caused by decom executor loss, so the decom info is already 
removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
good enough. Even we limit the size of removed executors to 10K, it could be 
only at most 10MB memory usage. In real case, it's rare to have cluster size of 
over 10K and the chance that all these executors decomed and lost at the same 
time would be small.

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40978:


Assignee: Max Gekk  (was: Apache Spark)

> Migrate failAnalysis() w/o context onto error classes
> -
>
> Key: SPARK-40978
> URL: https://issues.apache.org/jira/browse/SPARK-40978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Call `failAnalysis()` w/o context but with an error class instead of 
> `failAnalysis()` w/ a message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626853#comment-17626853
 ] 

Apache Spark commented on SPARK-40978:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38454

> Migrate failAnalysis() w/o context onto error classes
> -
>
> Key: SPARK-40978
> URL: https://issues.apache.org/jira/browse/SPARK-40978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Call `failAnalysis()` w/o context but with an error class instead of 
> `failAnalysis()` w/ a message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40978:


Assignee: Apache Spark  (was: Max Gekk)

> Migrate failAnalysis() w/o context onto error classes
> -
>
> Key: SPARK-40978
> URL: https://issues.apache.org/jira/browse/SPARK-40978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Call `failAnalysis()` w/o context but with an error class instead of 
> `failAnalysis()` w/ a message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40978) Migrate failAnalysis() w/o context onto error classes

2022-10-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626854#comment-17626854
 ] 

Apache Spark commented on SPARK-40978:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38454

> Migrate failAnalysis() w/o context onto error classes
> -
>
> Key: SPARK-40978
> URL: https://issues.apache.org/jira/browse/SPARK-40978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Call `failAnalysis()` w/o context but with an error class instead of 
> `failAnalysis()` w/ a message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37946) Use error classes in the execution errors related to partitions

2022-10-31 Thread Khalid Mammadov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626866#comment-17626866
 ] 

Khalid Mammadov commented on SPARK-37946:
-

Hi [~maxgekk], I see this one is not done yet here: 
partitionColumnNotFoundInSchemaError

Can I look into it?

Also, there are some more waiting to be done in QueryExecutionErrors.scala e.g.

stateNotDefinedOrAlreadyRemovedError

cannotSetTimeoutDurationError

cannotGetEventTimeWatermarkError

cannotSetTimeoutTimestampError

batchMetadataFileNotFoundError



Shall I look into these as well?

> Use error classes in the execution errors related to partitions
> ---
>
> Key: SPARK-37946
> URL: https://issues.apache.org/jira/browse/SPARK-37946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unableToDeletePartitionPathError
> * unableToCreatePartitionPathError
> * unableToRenamePartitionPathError
> * notADatasourceRDDPartitionError
> * cannotClearPartitionDirectoryError
> * failedToCastValueToDataTypeForPartitionColumnError
> * unsupportedPartitionTransformError
> * cannotCreateJDBCTableWithPartitionsError
> * requestedPartitionsMismatchTablePartitionsError
> * dynamicPartitionKeyNotAmongWrittenPartitionPathsError
> * cannotRemovePartitionDirError
> * alterTableWithDropPartitionAndPurgeUnsupportedError
> * invalidPartitionFilterError
> * getPartitionMetadataByFilterError
> * illegalLocationClauseForViewPartitionError
> * partitionColumnNotFoundInSchemaError
> * cannotAddMultiPartitionsOnNonatomicPartitionTableError
> * cannotDropMultiPartitionsOnNonatomicPartitionTableError
> * truncateMultiPartitionUnsupportedError
> * dynamicPartitionOverwriteUnsupportedByTableError
> * writePartitionExceedConfigSizeWhenDynamicPartitionError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40815.
---
Fix Version/s: 3.4.0
 Assignee: Ivan Sadikov
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/38277

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-31 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626878#comment-17626878
 ] 

Rui Wang commented on SPARK-40951:
--

[~dongjoon] Is this JIRA fully resolved already? Can we close this JIRA now? 

> pyspark-connect tests should be skipped if pandas doesn't exist
> ---
>
> Key: SPARK-40951
> URL: https://issues.apache.org/jira/browse/SPARK-40951
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40944) Relax ordering constraint for CREATE TABLE column options

2022-10-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-40944:
--

Assignee: Daniel

> Relax ordering constraint for CREATE TABLE column options
> -
>
> Key: SPARK-40944
> URL: https://issues.apache.org/jira/browse/SPARK-40944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>
> Currently the grammar for each CREATE TABLE column is:
> createOrReplaceTableColType
> : colName=errorCapturingIdentifier dataType (NOT NULL)? 
> defaultExpression? commentSpec?
> ;
> This enforces a constraint on the order of: (NOT NULL, DEFAULT value, COMMENT 
> value). We can update the grammar to allow these options in any order 
> instead, to improve usability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40944) Relax ordering constraint for CREATE TABLE column options

2022-10-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40944.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38418
[https://github.com/apache/spark/pull/38418]

> Relax ordering constraint for CREATE TABLE column options
> -
>
> Key: SPARK-40944
> URL: https://issues.apache.org/jira/browse/SPARK-40944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently the grammar for each CREATE TABLE column is:
> createOrReplaceTableColType
> : colName=errorCapturingIdentifier dataType (NOT NULL)? 
> defaultExpression? commentSpec?
> ;
> This enforces a constraint on the order of: (NOT NULL, DEFAULT value, COMMENT 
> value). We can update the grammar to allow these options in any order 
> instead, to improve usability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29683) Job failed due to executor failures all available nodes are blacklisted

2022-10-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626884#comment-17626884
 ] 

Attila Zsolt Piros commented on SPARK-29683:


[~srowen] I think we can close this as this commit solved the issue:
https://github.com/apache/spark/commit/e70df2cea46f71461d8d401a420e946f999862c1

What do you think?

> Job failed due to executor failures all available nodes are blacklisted
> ---
>
> Key: SPARK-29683
> URL: https://issues.apache.org/jira/browse/SPARK-29683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Priority: Major
>
> My streaming job will fail *due to executor failures all available nodes are 
> blacklisted*. This exception is thrown only when all node is blacklisted:
> {code:java}
> def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
> numClusterNodes
> val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ 
> allocatorBlacklist.keySet
> {code}
> After diving into the code, I found some critical conditions not be handled 
> properly:
>  - unchecked `excludeNodes`: it comes from user config. If not set properly, 
> it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For 
> example, we may set some nodes not in Yarn cluster.
> {code:java}
> excludeNodes = (invalid1, invalid2, invalid3)
> clusterNodes = (valid1, valid2)
> {code}
>  - `numClusterNodes` may equals 0: When HA Yarn failover, it will take some 
> time for all NodeManagers to register ResourceManager again. In this case, 
> `numClusterNode` may equals 0 or some other number, and Spark driver failed.
>  - too strong condition check: Spark driver will fail as long as 
> "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should 
> not indicate a unrecovered fatal. For example, there are some NodeManagers 
> restarting. So we can give some waiting time before job failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40933) Reimplement df.stat.{cov, corr} with built-in sql functions

2022-10-31 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40933:
--
Summary: Reimplement df.stat.{cov, corr} with built-in sql functions  (was: 
Make df.stat.{cov, corr} consistent with sql functions)

> Reimplement df.stat.{cov, corr} with built-in sql functions
> ---
>
> Key: SPARK-40933
> URL: https://issues.apache.org/jira/browse/SPARK-40933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40827) Re-enable the DataFrame.corrwith test after fixing in future pandas.

2022-10-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40827:


Assignee: Apache Spark

> Re-enable the DataFrame.corrwith test after fixing in future pandas.
> 
>
> Key: SPARK-40827
> URL: https://issues.apache.org/jira/browse/SPARK-40827
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should re-enable the skipped test that commented with "Regression in 
> pandas 1.5.0" after the behavior is fixed in future pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >