[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40355:

Description: 
Table Schema:
| a| b| part|
| int|string  | string|

 

SQL:
select * from table where b = 1 and part = 1

A.Partition column 'part' has been pushed down, but column 'b' not push down.
!https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!

B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
down.
!https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!

  was:
Table Schema:
| a| b| part| |
| int|string  | string| |

 

SQL:
select * from table where b = 1 and part = 1

A.Partition column 'part' has been pushed down, but column 'b' not push down.
!https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!

B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
down.
!https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!


> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> Table Schema:
> | a| b| part|
> | int|string  | string|
>  
> SQL:
> select * from table where b = 1 and part = 1
> A.Partition column 'part' has been pushed down, but column 'b' not push down.
> !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!
> B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
> down.
> !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601139#comment-17601139
 ] 

Apache Spark commented on SPARK-40365:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37814

> Bump ANTLR runtime version from 4.8 to 4.11.1
> -
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40365:


Assignee: (was: Apache Spark)

> Bump ANTLR runtime version from 4.8 to 4.11.1
> -
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601138#comment-17601138
 ] 

Apache Spark commented on SPARK-40365:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37814

> Bump ANTLR runtime version from 4.8 to 4.11.1
> -
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40365:


Assignee: Apache Spark

> Bump ANTLR runtime version from 4.8 to 4.11.1
> -
>
> Key: SPARK-40365
> URL: https://issues.apache.org/jira/browse/SPARK-40365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1

2022-09-06 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-40365:
---

 Summary: Bump ANTLR runtime version from 4.8 to 4.11.1
 Key: SPARK-40365
 URL: https://issues.apache.org/jira/browse/SPARK-40365
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: BingKun Pan
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40355:

Description: 
Table Schema:
| a| b| part| |
| int|string  | string| |

 

SQL:
select * from table where b = 1 and part = 1

A.Partition column 'part' has been pushed down, but column 'b' not push down.
!https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!

B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
down.
!https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!

  was:
Table Schema:
| a| b| part| |
| int|string  | string| |

 

SQL:
select * from table where b = 1 and part = 1

A.Partition column 'part' has been pushed down, but column 'b' no push down.
!https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!

B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
down.
!https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!


> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> Table Schema:
> | a| b| part| |
> | int|string  | string| |
>  
> SQL:
> select * from table where b = 1 and part = 1
> A.Partition column 'part' has been pushed down, but column 'b' not push down.
> !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!
> B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
> down.
> !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-40364:


 Summary: Unify `initDB` method in `DBProvider`
 Key: SPARK-40364
 URL: https://issues.apache.org/jira/browse/SPARK-40364
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Tests, YARN
Affects Versions: 3.4.0
Reporter: Yang Jie


There are 2 `initDB` in `DBProvider`
 # `DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
ObjectMapper mapper)
 # DB initDB(DBBackend dbBackend, File file)

and the second one only used by test ShuffleTestAccessor, we can explore 
whether can keep only the first method

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40364) Unify `initDB` method in `DBProvider`

2022-09-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40364:
-
Description: 
There are 2 `initDB` in `DBProvider`
 # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
ObjectMapper mapper)
 # DB initDB(DBBackend dbBackend, File file)

and the second one only used by test ShuffleTestAccessor, we can explore 
whether can keep only the first method

 

  was:
There are 2 `initDB` in `DBProvider`
 # `DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
ObjectMapper mapper)
 # DB initDB(DBBackend dbBackend, File file)

and the second one only used by test ShuffleTestAccessor, we can explore 
whether can keep only the first method

 


> Unify `initDB` method in `DBProvider`
> -
>
> Key: SPARK-40364
> URL: https://issues.apache.org/jira/browse/SPARK-40364
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are 2 `initDB` in `DBProvider`
>  # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, 
> ObjectMapper mapper)
>  # DB initDB(DBBackend dbBackend, File file)
> and the second one only used by test ShuffleTestAccessor, we can explore 
> whether can keep only the first method
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40363) Add SQL misc function to assert/check column value

2022-09-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40363.
--
Resolution: Won't Fix

> Add SQL misc function to assert/check column value
> --
>
> Key: SPARK-40363
> URL: https://issues.apache.org/jira/browse/SPARK-40363
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> SQL function that allows to assert a condition on a column that:
> * fails when condition is not met
> * returns original value otherwise
> Related: SPARK-32793
> But {{assert_true}} and {{raise_error}} do not really cut it. In case of 
> {{assert_true}} you have to actually collect the empty column, and the check 
> might no happen if you drop the assertion column, which you will likely do 
> since it's empty. Having a function that returns some value as part of the 
> check, in most cases it would be the checked column would be handy.
> I'm working with pyspark, so here's python implementation:
> {code:python}
> @overload
> def assert_col_condition(
> col: Union[str, Column],
> cond: Callable[[Column], Column],
> error_msg: Optional[str] = None,
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> @overload
> def assert_col_condition(
> col: Union[str, Column], cond: Column, error_msg: Optional[str] = None
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> def assert_col_condition(
> col: Union[str, Column],
> cond: Union[Column, Callable[[Column], Column]],
> error_msg: Optional[str] = None,
> ) -> Column:
> col = str_to_col(col)
> if not isinstance(cond, Column):
> cond = cond(col)
> return F.when(
> ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}")
> ).otherwise(col)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40363) Add SQL misc function to assert/check column value

2022-09-06 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601090#comment-17601090
 ] 

Rafal Wojdyla commented on SPARK-40363:
---

Hey [~hyukjin.kwon] yea, the implementation is simple, which is great. We find 
this useful, if you don't think this fits into spark, feel free to close this 
issue. Otherwise I'm happy to PR it.

> Add SQL misc function to assert/check column value
> --
>
> Key: SPARK-40363
> URL: https://issues.apache.org/jira/browse/SPARK-40363
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> SQL function that allows to assert a condition on a column that:
> * fails when condition is not met
> * returns original value otherwise
> Related: SPARK-32793
> But {{assert_true}} and {{raise_error}} do not really cut it. In case of 
> {{assert_true}} you have to actually collect the empty column, and the check 
> might no happen if you drop the assertion column, which you will likely do 
> since it's empty. Having a function that returns some value as part of the 
> check, in most cases it would be the checked column would be handy.
> I'm working with pyspark, so here's python implementation:
> {code:python}
> @overload
> def assert_col_condition(
> col: Union[str, Column],
> cond: Callable[[Column], Column],
> error_msg: Optional[str] = None,
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> @overload
> def assert_col_condition(
> col: Union[str, Column], cond: Column, error_msg: Optional[str] = None
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> def assert_col_condition(
> col: Union[str, Column],
> cond: Union[Column, Callable[[Column], Column]],
> error_msg: Optional[str] = None,
> ) -> Column:
> col = str_to_col(col)
> if not isinstance(cond, Column):
> cond = cond(col)
> return F.when(
> ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}")
> ).otherwise(col)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40356:


Assignee: Yikun Jiang

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40356.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37810
[https://github.com/apache/spark/pull/37810]

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601078#comment-17601078
 ] 

Apache Spark commented on SPARK-40228:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37813

> Don't simplify multiLike if child is not attribute
> --
>
> Key: SPARK-40228
> URL: https://issues.apache.org/jira/browse/SPARK-40228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:scala}
> sql("create table t1(name string) using parquet")
> sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
> '%c%')").explain(true)
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 
> 1, 5), b)) OR Contains(substr(name#0, 1, 5), c))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
> [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) 
> OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601077#comment-17601077
 ] 

Apache Spark commented on SPARK-40228:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37813

> Don't simplify multiLike if child is not attribute
> --
>
> Key: SPARK-40228
> URL: https://issues.apache.org/jira/browse/SPARK-40228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:scala}
> sql("create table t1(name string) using parquet")
> sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
> '%c%')").explain(true)
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 
> 1, 5), b)) OR Contains(substr(name#0, 1, 5), c))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
> [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) 
> OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40363) Add SQL misc function to assert/check column value

2022-09-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601074#comment-17601074
 ] 

Hyukjin Kwon commented on SPARK-40363:
--

So basically this is:

{code}
F.when(~cond, F.raise_error("...")).otherwise(col)
{code}

which can be very easily worked around. I doubt if we need this. Do other 
Python libraries or other DBMSes have such feature?

> Add SQL misc function to assert/check column value
> --
>
> Key: SPARK-40363
> URL: https://issues.apache.org/jira/browse/SPARK-40363
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> SQL function that allows to assert a condition on a column that:
> * fails when condition is not met
> * returns original value otherwise
> Related: SPARK-32793
> But {{assert_true}} and {{raise_error}} do not really cut it. In case of 
> {{assert_true}} you have to actually collect the empty column, and the check 
> might no happen if you drop the assertion column, which you will likely do 
> since it's empty. Having a function that returns some value as part of the 
> check, in most cases it would be the checked column would be handy.
> I'm working with pyspark, so here's python implementation:
> {code:python}
> @overload
> def assert_col_condition(
> col: Union[str, Column],
> cond: Callable[[Column], Column],
> error_msg: Optional[str] = None,
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> @overload
> def assert_col_condition(
> col: Union[str, Column], cond: Column, error_msg: Optional[str] = None
> ) -> Column:
> """Asserts condition on a column, IFF it holds returns the original value 
> under `col`"""
> ...
> def assert_col_condition(
> col: Union[str, Column],
> cond: Union[Column, Callable[[Column], Column]],
> error_msg: Optional[str] = None,
> ) -> Column:
> col = str_to_col(col)
> if not isinstance(cond, Column):
> cond = cond(col)
> return F.when(
> ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}")
> ).otherwise(col)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601067#comment-17601067
 ] 

Asif commented on SPARK-40362:
--

will be generating a PR tomorrow..

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601066#comment-17601066
 ] 

Asif edited comment on SPARK-40362 at 9/7/22 12:43 AM:
---

I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 

trait CommutativeExpresionCanonicalization {
    this: Expression =>
    override lazy val canonicalized: Expression = preCanonicalized
    override lazy val preCanonicalized: Expression = {
       val canonicalizedChildren = children.map(_.preCanonicalized)
        Canonicalize.reorderCommutativeOperators {
           withNewChildren(canonicalizedChildren)
       }
    }
}


was (Author: ashahid7):
I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
this: Expression =>
override lazy val canonicalized: Expression = preCanonicalized
override lazy val preCanonicalized: Expression = {
val canonicalizedChildren = children.map(_.preCanonicalized)
Canonicalize.reorderCommutativeOperators {
withNewChildren(canonicalizedChildren)
}
}
}{quote}

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601066#comment-17601066
 ] 

Asif edited comment on SPARK-40362 at 9/7/22 12:42 AM:
---

I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
this: Expression =>
override lazy val canonicalized: Expression = preCanonicalized
override lazy val preCanonicalized: Expression = {
val canonicalizedChildren = children.map(_.preCanonicalized)
Canonicalize.reorderCommutativeOperators {
withNewChildren(canonicalizedChildren)
}
}
}{quote}


was (Author: ashahid7):
I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
  this: Expression =>
  override lazy val canonicalized: Expression = preCanonicalized
  override lazy val preCanonicalized: Expression = {
    val canonicalizedChildren = children.map(_.preCanonicalized)
    Canonicalize.reorderCommutativeOperators {
     withNewChildren(canonicalizedChildren)
    }
  }
}{quote}

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601066#comment-17601066
 ] 

Asif edited comment on SPARK-40362 at 9/7/22 12:42 AM:
---

I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
  this: Expression =>
  override lazy val canonicalized: Expression = preCanonicalized
  override lazy val preCanonicalized: Expression = {
    val canonicalizedChildren = children.map(_.preCanonicalized)
    Canonicalize.reorderCommutativeOperators {
     withNewChildren(canonicalizedChildren)
    }
  }
}{quote}


was (Author: ashahid7):
I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
  this: Expression =>
   override lazy val canonicalized: Expression = preCanonicalized
   override lazy val preCanonicalized: Expression = {
      val canonicalizedChildren = children.map(_.preCanonicalized)
      Canonicalize.reorderCommutativeOperators {
            withNewChildren(canonicalizedChildren)
     }
  }
}
{quote}

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601066#comment-17601066
 ] 

Asif commented on SPARK-40362:
--

I have added locally test for 

&&, ||, &, | and *  and they all fail for the same reason..

in process of adding tests for xor, greatest & least and testing the fix for 
the same...

 

the fix is to make these expressions implement a trait

 
{quote}trait CommutativeExpresionCanonicalization {
  this: Expression =>
   override lazy val canonicalized: Expression = preCanonicalized
   override lazy val preCanonicalized: Expression = {
      val canonicalizedChildren = children.map(_.preCanonicalized)
      Canonicalize.reorderCommutativeOperators {
            withNewChildren(canonicalizedChildren)
     }
  }
}
{quote}

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize.
> Consider following expression:
> a + b > 10
> This GreaterThan expression when canonicalized as a whole for first time,  
> will skip the call to Canonicalize.reorderCommutativeOperators for the Add 
> expression as the GreaterThan's canonicalization used precanonicalize on 
> children ( the Add expression).
>  
> so if create a new expression 
> b + a > 10 and invoke canonicalize it, the canonicalized versions of these 
> two expressions will not match.
> The test 
> {quote}
> test("bug X") {
>     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
>    val y = tr1.where('a.attr + 'c.attr > 10).analyze
>    val fullCond = y.asInstanceOf[Filter].condition.clone()
>   val addExpr = (fullCond match {
>     case GreaterThan(x: Add, _) => x
>     case LessThan(_, x: Add) => x
>   }).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions,  the 
> precanonicalize should be overridden and   
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize.  effectively for commutative operands ( add, or 
> , multiply , and etc)  canonicalize and precanonicalize should be same.
> I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module

2022-09-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33605:
--
Affects Version/s: 3.4.0
   (was: 3.0.1)

> Add gcs-connector to hadoop-cloud module
> 
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Rafal Wojdyla
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module

2022-09-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33605:
--
Component/s: Build
 (was: PySpark)
 (was: Spark Core)

> Add gcs-connector to hadoop-cloud module
> 
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Rafal Wojdyla
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33605:
-

Assignee: Dongjoon Hyun

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33605.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37745
[https://github.com/apache/spark/pull/37745]

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module

2022-09-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33605:
--
Summary: Add gcs-connector to hadoop-cloud module  (was: Add GCS 
FS/connector config (dependencies?) akin to S3)

> Add gcs-connector to hadoop-cloud module
> 
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40363) Add SQL misc function to assert/check column value

2022-09-06 Thread Rafal Wojdyla (Jira)
Rafal Wojdyla created SPARK-40363:
-

 Summary: Add SQL misc function to assert/check column value
 Key: SPARK-40363
 URL: https://issues.apache.org/jira/browse/SPARK-40363
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Rafal Wojdyla


SQL function that allows to assert a condition on a column that:
* fails when condition is not met
* returns original value otherwise

Related: SPARK-32793

But {{assert_true}} and {{raise_error}} do not really cut it. In case of 
{{assert_true}} you have to actually collect the empty column, and the check 
might no happen if you drop the assertion column, which you will likely do 
since it's empty. Having a function that returns some value as part of the 
check, in most cases it would be the checked column would be handy.

I'm working with pyspark, so here's python implementation:

{code:python}
@overload
def assert_col_condition(
col: Union[str, Column],
cond: Callable[[Column], Column],
error_msg: Optional[str] = None,
) -> Column:
"""Asserts condition on a column, IFF it holds returns the original value 
under `col`"""
...


@overload
def assert_col_condition(
col: Union[str, Column], cond: Column, error_msg: Optional[str] = None
) -> Column:
"""Asserts condition on a column, IFF it holds returns the original value 
under `col`"""
...


def assert_col_condition(
col: Union[str, Column],
cond: Union[Column, Callable[[Column], Column]],
error_msg: Optional[str] = None,
) -> Column:
col = str_to_col(col)
if not isinstance(cond, Column):
cond = cond(col)
return F.when(
~cond, F.raise_error(error_msg or f"Assertion failed: {cond}")
).otherwise(col)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-06 Thread Asif (Jira)
Asif created SPARK-40362:


 Summary: Bug in Canonicalization of expressions like Add & 
Multiply i.e Commutative Operators
 Key: SPARK-40362
 URL: https://issues.apache.org/jira/browse/SPARK-40362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Asif


In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize.

Consider following expression:

a + b > 10

This GreaterThan expression when canonicalized as a whole for first time,  will 
skip the call to Canonicalize.reorderCommutativeOperators for the Add 
expression as the GreaterThan's canonicalization used precanonicalize on 
children ( the Add expression).

 

so if create a new expression 

b + a > 10 and invoke canonicalize it, the canonicalized versions of these two 
expressions will not match.

The test 
{quote}
test("bug X") {
    val tr1 = LocalRelation('c.int, 'b.string, 'a.int)

   val y = tr1.where('a.attr + 'c.attr > 10).analyze
   val fullCond = y.asInstanceOf[Filter].condition.clone()
  val addExpr = (fullCond match {
    case GreaterThan(x: Add, _) => x
    case LessThan(_, x: Add) => x
  }).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}

This test fails.

The fix which I propose is that for the commutative expressions,  the 
precanonicalize should be overridden and   
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize.  effectively for commutative operands ( add, or , 
multiply , and etc)  canonicalize and precanonicalize should be same.
I will file a PR for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2022-09-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601030#comment-17601030
 ] 

Igor Uchôa edited comment on SPARK-36901 at 9/6/22 10:06 PM:
-

I'm facing this too. In my case, we have a shared cluster where multiple 
applications run on it at the same time and they use Dynamic Allocation. 
Depending on the cluster load, the application gets a broadcast timeout error. 
In my opinion, resource allocation delays shouldn't cause errors in the 
application, especially regarding broadcast, which seems unrelated to the 
problem at first glance.

The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` 
which affects all broadcast operations. Therefore, if we give this value a huge 
number (or infinite), we will solve the broadcast timeout issue, but we will 
allow users to explicitly broadcast huge data sets in their queries, which 
seems really bad from my perspective.

Since the broadcast exchange is a job, it will trigger the lazy evaluation, and 
depending on resource availability, it will start immediately, or in the worst 
case, it will wait until the application has enough executors. The problem is 
that the timer starts to count the moment the job is created and not at the 
moment it started the first task. So if we have a delay in the resource 
allocation, the application will eventually fire the broadcast timeout error: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211

My suggestion is to change the timer approach and make it starts only if the 
application has at least one running task. In this way we will be able to use 
`spark.sql.broadcastTimeout` properly. Please, let me know your thoughts


was (Author: JIRAUSER293801):
I'm facing this too. In my case, we have a shared cluster where multiple 
applications run on it at the same time and they use Dynamic Allocation. 
Depending on the cluster load, the application gets a broadcast timeout error. 
In my opinion, resource allocation delays shouldn't cause errors in the 
application, especially regarding broadcast, which seems unrelated to the 
problem at first glance.

The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` 
which affects all broadcast operations. Therefore, if we give this value a huge 
number (or infinite), we will solve the broadcast timeout issue, but we will 
allow users to explicitly broadcast huge data sets in their queries, which 
seems really bad from my perspective.

Since the broadcast exchange is a job, it will trigger the lazy evaluation, and 
depending on resource availability, it will start immediately, or in the worst 
case, it will wait until the application has enough executors. The problem is 
that the timer starts to count the moment the job is created and not at the 
moment it started the first task. So if we have a delay in the resource 
allocation, the application will eventually fire the broadcast timeout error: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211

My suggestion is to change the timer approach and make it starts only if the 
application has at least one running task. In this way we will be able to use 
`spark.sql.broadcastTimeout` properly.

> ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
> -
>
> Key: SPARK-36901
> URL: https://issues.apache.org/jira/browse/SPARK-36901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Ranga Reddy
>Priority: Major
>
> While running Spark application, if there are no further resources to launch 
> executors, Spark application is failed after 5 mins with below exception.
> {code:java}
> 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> ...
> 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
> broadcast in 300 secs.
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
>   ... 71 more
> 21/09/24 06:17:30 INFO 

[jira] [Commented] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2022-09-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601030#comment-17601030
 ] 

Igor Uchôa commented on SPARK-36901:


I'm facing this too. In my case, we have a shared cluster where multiple 
applications run on it at the same time and they use Dynamic Allocation. 
Depending on the cluster load, the application gets a broadcast timeout error. 
In my opinion, resource allocation delays shouldn't cause errors in the 
application, especially regarding broadcast, which seems unrelated to the 
problem at first glance.

The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` 
which affects all broadcast operations. Therefore, if we give this value a huge 
number (or infinite), we will solve the broadcast timeout issue, but we will 
allow users to explicitly broadcast huge data sets in their queries, which 
seems really bad from my perspective.

Since the broadcast exchange is a job, it will trigger the lazy evaluation, and 
depending on resource availability, it will start immediately, or in the worst 
case, it will wait until the application has enough executors. The problem is 
that the timer starts to count the moment the job is created and not at the 
moment it started the first task. So if we have a delay in the resource 
allocation, the application will eventually fire the broadcast timeout error: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211

My suggestion is to change the timer approach and make it starts only if the 
application has at least one running task. In this way we will be able to use 
`spark.sql.broadcastTimeout` properly.

> ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
> -
>
> Key: SPARK-36901
> URL: https://issues.apache.org/jira/browse/SPARK-36901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Ranga Reddy
>Priority: Major
>
> While running Spark application, if there are no further resources to launch 
> executors, Spark application is failed after 5 mins with below exception.
> {code:java}
> 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> ...
> 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
> broadcast in 300 secs.
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
>   ... 71 more
> 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
> {code}
> *Expectation* should be either needs to be throw proper exception saying 
> *"there are no further resources to run the application"* or it needs to be 
> *"wait till it get resources"*.
> To reproduce the issue we have used following sample code.
> *PySpark Code (test_broadcast_timeout.py):*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()
> t1 = spark.range(5)
> t2 = spark.range(5)
> q = t1.join(t2,t1.id == t2.id)
> q.explain
> q.show(){code}
> *Spark Submit Command:*
> {code:java}
> spark-submit --executor-memory 512M test_broadcast_timeout.py{code}
>  Note: We have tested same code in Spark 3.1, we are able to reproduce the 
> issue in Spark3 as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40262) Expensive UDF evaluation pushed down past a join leads to performance issues

2022-09-06 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601021#comment-17601021
 ] 

Shardul Mahadik commented on SPARK-40262:
-

[~cloud_fan] [~viirya] [~joshrosen] Gentle ping on this!

> Expensive UDF evaluation pushed down past a join leads to performance issues 
> -
>
> Key: SPARK-40262
> URL: https://issues.apache.org/jira/browse/SPARK-40262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Spark job with an expensive UDF which looks like follows:
> {code:scala}
> val expensive_udf = spark.udf.register("expensive_udf", (i: Int) => Option(i))
> spark.range(10).write.format("orc").save("/tmp/orc")
> val df = spark.read.format("orc").load("/tmp/orc").as("a")
> .join(spark.range(10).as("b"), "id")
> .withColumn("udf_op", expensive_udf($"a.id"))
> .join(spark.range(10).as("c"), $"udf_op" === $"c.id")
> {code}
> This creates a physical plan as follows:
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [cast(udf_op#338 as bigint)], [id#344L], Inner, 
> BuildRight, false
>:- Project [id#330L, if (isnull(cast(id#330L as int))) null else 
> expensive_udf(knownnotnull(cast(id#330L as int))) AS udf_op#338]
>:  +- BroadcastHashJoin [id#330L], [id#332L], Inner, BuildRight, false
>: :- Filter ((isnotnull(id#330L) AND isnotnull(cast(id#330L as int))) 
> AND isnotnull(expensive_udf(knownnotnull(cast(id#330L as int)
>: :  +- FileScan orc [id#330L] Batched: true, DataFilters: 
> [isnotnull(id#330L), isnotnull(cast(id#330L as int)), 
> isnotnull(expensive_udf(knownnotnull(cast(i..., Format: ORC, Location: 
> InMemoryFileIndex(1 paths)[file:/tmp/orc], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id)], ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [plan_id=416]
>:+- Range (0, 10, step=1, splits=16)
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [plan_id=420]
>   +- Range (0, 10, step=1, splits=16)
> {code}
> In this case, the expensive UDF call is duplicated thrice. Since the UDF 
> output is used in a future join, `InferFiltersFromConstraints` adds an `IS 
> NOT NULL` filter on the UDF output. But the pushdown rules duplicate this UDF 
> call and push the UDF past a previous join. The duplication behaviour [is 
> documented|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L196]
>  and in itself is not a huge issue. But given a highly restrictive join, the 
> UDF gets evaluated on many orders of magnitude more rows than it should have 
> slowing down the job.
> Can we avoid this duplication of UDF calls? In SPARK-37392, we made a 
> [similar change|https://github.com/apache/spark/pull/34823/files] where we 
> decided to only add inferred filters if the input is an attribute. Should we 
> use a similar strategy for `InferFiltersFromConstraints`?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40360:


Assignee: (was: Apache Spark)

> Convert some DDL exception to new error framework
> -
>
> Key: SPARK-40360
> URL: https://issues.apache.org/jira/browse/SPARK-40360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Tackling the following files:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
> Here is the doc with proposed text:
> https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40360:


Assignee: Apache Spark

> Convert some DDL exception to new error framework
> -
>
> Key: SPARK-40360
> URL: https://issues.apache.org/jira/browse/SPARK-40360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> Tackling the following files:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
> Here is the doc with proposed text:
> https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601003#comment-17601003
 ] 

Apache Spark commented on SPARK-40360:
--

User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/37811

> Convert some DDL exception to new error framework
> -
>
> Key: SPARK-40360
> URL: https://issues.apache.org/jira/browse/SPARK-40360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Tackling the following files:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
> Here is the doc with proposed text:
> https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40360:


Assignee: Apache Spark

> Convert some DDL exception to new error framework
> -
>
> Key: SPARK-40360
> URL: https://issues.apache.org/jira/browse/SPARK-40360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> Tackling the following files:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
> Here is the doc with proposed text:
> https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-06 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600948#comment-17600948
 ] 

Pankaj Nagla commented on SPARK-40149:
--

Thank you for sharing the information. [Vlocity 
Training|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40318) try_avg() should throw the exceptions from its child

2022-09-06 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40318.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37775
[https://github.com/apache/spark/pull/37775]

> try_avg() should throw the exceptions from its child
> 
>
> Key: SPARK-40318
> URL: https://issues.apache.org/jira/browse/SPARK-40318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> milar to [https://github.com/apache/spark/pull/37486] and 
> [https://github.com/apache/spark/pull/37663,]  the errors from try_avg()'s 
> child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-09-06 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600894#comment-17600894
 ] 

Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:03 PM:
---

Thank you for sharing the information. [React Native Online 
Course|https://www.igmguru.com/digital-marketing-programming/react-native-training/]
 is an integrated professional course aimed at providing learners with the 
skills and knowledge of React Native, a mobile application framework used for 
the development of mobile applications for Android, iOS, UWP (Universal Windows 
Platform), and the web.


was (Author: JIRAUSER295436):
Thank you for sharing the information. 
[url=https://www.igmguru.com/digital-marketing-programming/react-native-training/]React
 Native Online Course[/url] is an integrated professional course aimed at 
providing learners with the skills and knowledge of React Native, a mobile 
application framework used for the development of mobile applications for 
Android, iOS, UWP (Universal Windows Platform), and web.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-09-06 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600894#comment-17600894
 ] 

Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:01 PM:
---

Thank you for sharing the information. 
[url=https://www.igmguru.com/digital-marketing-programming/react-native-training/]React
 Native Online Course[/url] is an integrated professional course aimed at 
providing learners with the skills and knowledge of React Native, a mobile 
application framework used for the development of mobile applications for 
Android, iOS, UWP (Universal Windows Platform), and web.


was (Author: JIRAUSER295436):
Thank you for sharing the information. https://www.igmguru.com/digital-marketing-programming/react-native-training/;>React
 Native Online Course is an integrated professional course aimed at 
providing learners with the skills and knowledge of React Native, a mobile 
application framework used for the development of mobile applications for 
Android, iOS, UWP (Universal Windows Platform), and web.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-09-06 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600894#comment-17600894
 ] 

Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:00 PM:
---

Thank you for sharing the information. https://www.igmguru.com/digital-marketing-programming/react-native-training/;>React
 Native Online Course is an integrated professional course aimed at 
providing learners with the skills and knowledge of React Native, a mobile 
application framework used for the development of mobile applications for 
Android, iOS, UWP (Universal Windows Platform), and web.


was (Author: JIRAUSER295436):
Thank you for sharing the information. [React Native Online 
Course|[https://www.igmguru.com/digital-marketing-programming/react-native-training/]]
 is an integrated professional course aimed at providing learners with the 
skills and knowledge of React Native, a mobile application framework used for 
the development of mobile applications for Android, iOS, UWP (Universal Windows 
Platform), and web.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To 

[jira] [Updated] (SPARK-40361) Migrate arithmetic type check failures onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40361:
-
Description: 
Replace TypeCheckFailure by DataTypeMismatch in type checks in arithmetic 
expressions:
1. Least (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1188-L1191
2. Greatest (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1266-L1269

  was:
Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. WindowSpecDefinition (4): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
2. SpecifiedWindowFrame (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
3. checkBoundary (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
4. FrameLessOffsetWindowFunction (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
5. NthValue (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
6. NTile (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
  


> Migrate arithmetic type check failures onto error classes
> -
>
> Key: SPARK-40361
> URL: https://issues.apache.org/jira/browse/SPARK-40361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in arithmetic 
> expressions:
> 1. Least (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1188-L1191
> 2. Greatest (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1266-L1269



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40361) Migrate arithmetic type check failures onto error classes

2022-09-06 Thread Max Gekk (Jira)
Max Gekk created SPARK-40361:


 Summary: Migrate arithmetic type check failures onto error classes
 Key: SPARK-40361
 URL: https://issues.apache.org/jira/browse/SPARK-40361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. WindowSpecDefinition (4): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
2. SpecifiedWindowFrame (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
3. checkBoundary (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
4. FrameLessOffsetWindowFunction (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
5. NthValue (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
6. NTile (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-09-06 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600894#comment-17600894
 ] 

Nikhil Sharma commented on SPARK-22588:
---

Thank you for sharing the information. [React Native Online 
Course|[https://www.igmguru.com/digital-marketing-programming/react-native-training/]]
 is an integrated professional course aimed at providing learners with the 
skills and knowledge of React Native, a mobile application framework used for 
the development of mobile applications for Android, iOS, UWP (Universal Windows 
Platform), and web.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40357) Migrate window type check failures onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40357:
-
Summary: Migrate window type check failures onto error classes  (was: 
Migrate window type checks onto error classes)

> Migrate window type check failures onto error classes
> -
>
> Key: SPARK-40357
> URL: https://issues.apache.org/jira/browse/SPARK-40357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
> expressions:
> 1. WindowSpecDefinition (4): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
> 2. SpecifiedWindowFrame (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
> 3. checkBoundary (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
> 4. FrameLessOffsetWindowFunction (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
> 5. NthValue (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
> 6. NTile (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40358) Migrate collection type check failures onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40358:
-
Summary: Migrate collection type check failures onto error classes  (was: 
Migrate collection type checks onto error classes)

> Migrate collection type check failures onto error classes
> -
>
> Key: SPARK-40358
> URL: https://issues.apache.org/jira/browse/SPARK-40358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
> expressions:
> 1. BinaryArrayExpressionWithImplicitCast (1): 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
> 2. MapContainsKey (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
> 3. MapConcat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
> 4. MapFromEntries (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
> 5. SortArray (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
> 6. ArrayContains (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
> 7. ArrayPosition (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
> 8. ElementAt (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
> 9. Concat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
> 10. Flatten (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
> 11. Sequence (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
> 12. ArrayRemove (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
> 13. ArrayDistinct (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
> 14. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40359) Migrate JSON type check failures onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40359:
-
Summary: Migrate JSON type check failures onto error classes  (was: Migrate 
JSON type checks onto error classes)

> Migrate JSON type check failures onto error classes
> ---
>
> Key: SPARK-40359
> URL: https://issues.apache.org/jira/browse/SPARK-40359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON 
> expressions:
> 1. JsonTuple (2): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399
> 2. JsonToStructs (2): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574
> 3. StructsToJson (4): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743
> 4. SchemaOfJson (1): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40359) Migrate JSON type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40359:
-
Description: 
Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON expressions:
1. JsonTuple (2): 
https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399
2. JsonToStructs (2): 
https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574
3. StructsToJson (4): 
https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743
4. SchemaOfJson (1): 
https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806

  was:
Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. BinaryArrayExpressionWithImplicitCast (1): 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
2. MapContainsKey (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
3. MapConcat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
4. MapFromEntries (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
5. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
6. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
7. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
8. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
9. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
10. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
11. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
12. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
13. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
14. 


> Migrate JSON type checks onto error classes
> ---
>
> Key: SPARK-40359
> URL: https://issues.apache.org/jira/browse/SPARK-40359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON 
> expressions:
> 1. JsonTuple (2): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399
> 2. JsonToStructs (2): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574
> 3. StructsToJson (4): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743
> 4. SchemaOfJson (1): 
> https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-06 Thread Serge Rielau (Jira)
Serge Rielau created SPARK-40360:


 Summary: Convert some DDL exception to new error framework
 Key: SPARK-40360
 URL: https://issues.apache.org/jira/browse/SPARK-40360
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


Tackling the following files:

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

Here is the doc with proposed text:
https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40358) Migrate collection type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40358:
-
Description: 
Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
expressions:
1. BinaryArrayExpressionWithImplicitCast (1): 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
2. MapContainsKey (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
3. MapConcat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
4. MapFromEntries (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
5. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
6. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
7. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
8. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
9. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
10. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
11. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
12. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
13. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
14. 

  was:
Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. BinaryArrayExpressionWithImplicitCast (1): 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
2. MapContainsKey (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
3. MapConcat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
4. MapFromEntries (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
5. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
6. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
7. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
8. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
9. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
10. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
11. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
12. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
13. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
14. 


> Migrate collection type checks onto error classes
> -
>

[jira] [Created] (SPARK-40359) Migrate JSON type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)
Max Gekk created SPARK-40359:


 Summary: Migrate JSON type checks onto error classes
 Key: SPARK-40359
 URL: https://issues.apache.org/jira/browse/SPARK-40359
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. BinaryArrayExpressionWithImplicitCast (1): 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
2. MapContainsKey (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
3. MapConcat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
4. MapFromEntries (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
5. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
6. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
7. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
8. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
9. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
10. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
11. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
12. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
13. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
14. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40358) Migrate collection type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40358:
-
Description: 
Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. BinaryArrayExpressionWithImplicitCast (1): 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
2. MapContainsKey (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
3. MapConcat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
4. MapFromEntries (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
5. SortArray (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
6. ArrayContains (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
7. ArrayPosition (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
8. ElementAt (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
9. Concat (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
10. Flatten (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
11. Sequence (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
12. ArrayRemove (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
13. ArrayDistinct (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642
14. 

  was:
Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. WindowSpecDefinition (4): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
2. SpecifiedWindowFrame (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
3. checkBoundary (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
4. FrameLessOffsetWindowFunction (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
5. NthValue (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
6. NTile (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
  


> Migrate collection type checks onto error classes
> -
>
> Key: SPARK-40358
> URL: https://issues.apache.org/jira/browse/SPARK-40358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
> expressions:
> 1. BinaryArrayExpressionWithImplicitCast (1): 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69]
> 2. MapContainsKey (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237
> 3. MapConcat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663
> 4. MapFromEntries (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801
> 5. SortArray (3): 
> 

[jira] [Created] (SPARK-40358) Migrate collection type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)
Max Gekk created SPARK-40358:


 Summary: Migrate collection type checks onto error classes
 Key: SPARK-40358
 URL: https://issues.apache.org/jira/browse/SPARK-40358
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. WindowSpecDefinition (4): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
2. SpecifiedWindowFrame (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
3. checkBoundary (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
4. FrameLessOffsetWindowFunction (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
5. NthValue (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
6. NTile (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40357) Migrate window type checks onto error classes

2022-09-06 Thread Max Gekk (Jira)
Max Gekk created SPARK-40357:


 Summary: Migrate window type checks onto error classes
 Key: SPARK-40357
 URL: https://issues.apache.org/jira/browse/SPARK-40357
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
expressions:
1. WindowSpecDefinition (4): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
2. SpecifiedWindowFrame (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
3. checkBoundary (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
4. FrameLessOffsetWindowFunction (1): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424
5. NthValue (2): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700
6. NTile (3): 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804
  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-06 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600777#comment-17600777
 ] 

wuyi commented on SPARK-40320:
--

> Actually the  `CoarseGrainedExecutorBackend` JVM process  is active but the  
> communication thread is no longer working ( please see  
> `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
> doesn't receive any message)

Hmm, shouldn't it bring up a new `receiveLoop()` to serve RPC messages 
according to 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 . Why it doesn't?

 

Besides, why SparkUncaughtExceptionHandler doesn't catch the fatal error?

 

cc [~tgraves] [~mridulm80] 

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects

2022-09-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40315:
---

Assignee: Carmen Kwan

> Non-deterministic hashCode() calculations for ArrayBasedMapData on equal 
> objects
> 
>
> Key: SPARK-40315
> URL: https://issues.apache.org/jira/browse/SPARK-40315
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Carmen Kwan
>Assignee: Carmen Kwan
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> There is no explicit `hashCode()` function override for the 
> `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for 
> `ArrayBasedMapData` can be different for two equal objects (objects with 
> equal keys and values).
> This error is non-deterministic and hard to reproduce, as we don't control 
> the default `hashCode()` function.
> We should override the `hashCode` function so that it works exactly as we 
> expect. We should also have an explicit `equals()` function for consistency 
> with how `Literals` check for equality of `ArrayBasedMapData`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects

2022-09-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40315.
-
Fix Version/s: 3.3.1
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37807
[https://github.com/apache/spark/pull/37807]

> Non-deterministic hashCode() calculations for ArrayBasedMapData on equal 
> objects
> 
>
> Key: SPARK-40315
> URL: https://issues.apache.org/jira/browse/SPARK-40315
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Carmen Kwan
>Priority: Major
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
>
> There is no explicit `hashCode()` function override for the 
> `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for 
> `ArrayBasedMapData` can be different for two equal objects (objects with 
> equal keys and values).
> This error is non-deterministic and hard to reproduce, as we don't control 
> the default `hashCode()` function.
> We should override the `hashCode` function so that it works exactly as we 
> expect. We should also have an explicit `equals()` function for consistency 
> with how `Literals` check for equality of `ArrayBasedMapData`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-09-06 Thread Jayita Panda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600766#comment-17600766
 ] 

Jayita Panda commented on SPARK-38330:
--

Can someone elaborate the resolution here?Facing the same problem cc 

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: 
getFileStatus on 
s3a://com.twilio.dev.warehouse/data/warehouse/sfdc-user/v1.0/1/partitions/dt=20201012T/sfdc-user0/2020-10-12T.json.gz:
 com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate 
for  doesn't match any of the 
subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to 
execute HTTP request: Certificate for 
 doesn't match any of the subject 
alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> 

[jira] [Resolved] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40352.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37804
[https://github.com/apache/spark/pull/37804]

> Add function aliases: len, datepart, dateadd, date_diff and curdate
> ---
>
> Key: SPARK-40352
> URL: https://issues.apache.org/jira/browse/SPARK-40352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The functions len, datepart, dateadd, date_diff and curdate exist in other 
> systems, and Spark SQL has similar functions. So, adding such aliases will 
> make the migration to Spark SQL easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-06 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600756#comment-17600756
 ] 

Yikun Jiang commented on SPARK-40327:
-

[~podongfeng] Thanks, will work on it!

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-06 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600755#comment-17600755
 ] 

Yikun Jiang commented on SPARK-40332:
-

working on this

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600754#comment-17600754
 ] 

Apache Spark commented on SPARK-40356:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37810

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40356:


Assignee: Apache Spark

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40356:


Assignee: (was: Apache Spark)

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600753#comment-17600753
 ] 

Apache Spark commented on SPARK-40356:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37810

> Upgrade pandas to 1.4.4
> ---
>
> Key: SPARK-40356
> URL: https://issues.apache.org/jira/browse/SPARK-40356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40356) Upgrade pandas to 1.4.4

2022-09-06 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-40356:
---

 Summary: Upgrade pandas to 1.4.4
 Key: SPARK-40356
 URL: https://issues.apache.org/jira/browse/SPARK-40356
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40355:

Description: 
Table Schema:
| a| b| part| |
| int|string  | string| |

 

SQL:
select * from table where b = 1 and part = 1

A.Partition column 'part' has been pushed down, but column 'b' no push down.
!https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!

B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
down.
!https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!

  was:
 Table Schema:
|  a |  b |  part |
|---|---|---|
|  int | string   |  string | 

 SQL:
select * from table where b = 1 and part = 1

### A.Partition column 'part' has been pushed down, but column 'b' no push down.
https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png;>

### B.After apply the pr, Partition column 'part' and column 'b' has been 
pushed down.
https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png;>


> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> Table Schema:
> | a| b| part| |
> | int|string  | string| |
>  
> SQL:
> select * from table where b = 1 and part = 1
> A.Partition column 'part' has been pushed down, but column 'b' no push down.
> !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png!
> B.After apply the pr, Partition column 'part' and column 'b' has been pushed 
> down.
> !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40355:

Description: 
 Table Schema:
|  a |  b |  part |
|---|---|---|
|  int | string   |  string | 

 SQL:
select * from table where b = 1 and part = 1

### A.Partition column 'part' has been pushed down, but column 'b' no push down.
https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png;>

### B.After apply the pr, Partition column 'part' and column 'b' has been 
pushed down.
https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png;>

  was:I will supplement the scene later


> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
>  Table Schema:
> |  a |  b |  part |
> |---|---|---|
> |  int | string   |  string | 
>  SQL:
> select * from table where b = 1 and part = 1
> ### A.Partition column 'part' has been pushed down, but column 'b' no push 
> down.
>  src="https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png;>
> ### B.After apply the pr, Partition column 'part' and column 'b' has been 
> pushed down.
>  src="https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png;>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600737#comment-17600737
 ] 

Apache Spark commented on SPARK-40355:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37809

> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> I will supplement the scene later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40355:


Assignee: Apache Spark

> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>
> I will supplement the scene later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600736#comment-17600736
 ] 

Apache Spark commented on SPARK-40355:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37809

> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> I will supplement the scene later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40355:


Assignee: (was: Apache Spark)

> Improve pushdown for orc & parquet when cast scenario
> -
>
> Key: SPARK-40355
> URL: https://issues.apache.org/jira/browse/SPARK-40355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> I will supplement the scene later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario

2022-09-06 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-40355:
---

 Summary: Improve pushdown for orc & parquet when cast scenario
 Key: SPARK-40355
 URL: https://issues.apache.org/jira/browse/SPARK-40355
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan
 Fix For: 3.4.0


I will supplement the scene later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39830) Add a test case to read ORC table that requires type promotion

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600731#comment-17600731
 ] 

Apache Spark commented on SPARK-39830:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/37808

> Add a test case to read ORC table that requires type promotion
> --
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.4.0
>
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects

2022-09-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600708#comment-17600708
 ] 

Apache Spark commented on SPARK-40315:
--

User 'c27kwan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37807

> Non-deterministic hashCode() calculations for ArrayBasedMapData on equal 
> objects
> 
>
> Key: SPARK-40315
> URL: https://issues.apache.org/jira/browse/SPARK-40315
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Carmen Kwan
>Priority: Major
>
> There is no explicit `hashCode()` function override for the 
> `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for 
> `ArrayBasedMapData` can be different for two equal objects (objects with 
> equal keys and values).
> This error is non-deterministic and hard to reproduce, as we don't control 
> the default `hashCode()` function.
> We should override the `hashCode` function so that it works exactly as we 
> expect. We should also have an explicit `equals()` function for consistency 
> with how `Literals` check for equality of `ArrayBasedMapData`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40354) Support eliminate dynamic partition for v1 writes

2022-09-06 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-40354:
--
Description: 
v1 writes will add an extra sort for dynamic columns, e.g.
{code:java}
INSERT INTO TABLE t1 PARTITION(p)
SELECT c1, c2, 'a' as p FROM t2 {code}
if the dynamic columns are foldable, we can optimize it to:
{code:java}
INSERT INTO TABLE t1 PARTITION(p='a')
SELECT c1, c2 FROM t2 {code}

  was:
v1 writes will add a extra sort for dynamic columns, e.g.
{code:java}
INSERT INTO TABLE t1 PARTITION(p)
SELECT c1, c2, 'a' as p FROM t2 {code}
if the dynamic columns are foldable, we can optimize it to:
{code:java}
INSERT INTO TABLE t1 PARTITION(p='a')
SELECT c1, c2 FROM t2 {code}


> Support eliminate dynamic partition for v1 writes
> -
>
> Key: SPARK-40354
> URL: https://issues.apache.org/jira/browse/SPARK-40354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> v1 writes will add an extra sort for dynamic columns, e.g.
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p)
> SELECT c1, c2, 'a' as p FROM t2 {code}
> if the dynamic columns are foldable, we can optimize it to:
> {code:java}
> INSERT INTO TABLE t1 PARTITION(p='a')
> SELECT c1, c2 FROM t2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40354) Support eliminate dynamic partition for v1 writes

2022-09-06 Thread XiDuo You (Jira)
XiDuo You created SPARK-40354:
-

 Summary: Support eliminate dynamic partition for v1 writes
 Key: SPARK-40354
 URL: https://issues.apache.org/jira/browse/SPARK-40354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


v1 writes will add a extra sort for dynamic columns, e.g.
{code:java}
INSERT INTO TABLE t1 PARTITION(p)
SELECT c1, c2, 'a' as p FROM t2 {code}
if the dynamic columns are foldable, we can optimize it to:
{code:java}
INSERT INTO TABLE t1 PARTITION(p='a')
SELECT c1, c2 FROM t2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40324) Provide a query context of ParseException

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40324.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37794
[https://github.com/apache/spark/pull/37794]

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40300) Migrate onto the DATATYPE_MISMATCH error classes

2022-09-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40300.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37744
[https://github.com/apache/spark/pull/37744]

> Migrate onto the DATATYPE_MISMATCH error classes
> 
>
> Key: SPARK-40300
> URL: https://issues.apache.org/jira/browse/SPARK-40300
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Introduce new error class DATATYPE_MISMATCH w/ some sub-classes, and convert 
> TypeCheckFailure to another class w/ error sub-classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-06 Thread Oliver Koeth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600669#comment-17600669
 ] 

Oliver Koeth commented on SPARK-39546:
--

Yes, looks like this should do it. Thank you

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600631#comment-17600631
 ] 

Dongjoon Hyun commented on SPARK-40298:
---

Thank you for trying Apache Spark feature, [~todd5167] , but as [~hyukjin.kwon] 
mentioned, this is more like a question.

First, could you provide a reproducible test case for your case? I want to help 
you.

Second, I assume that you verified KubernetesLocalDiskShuffleExecutorComponents 
logs correctly. However, the following could be partial observation.
{quote}It can be confirmed that the pvc has been multiplexed by other pods, and 
the Index and data data information has been sent
{quote}
SPARK-35593 was designed to help the recovery and to improve the stability at 
the best effort approach without any regressions which mean SPARK-35593 doesn't 
aim to block the existing Spark features like re-computation or executor 
allocation with new PVC. More specifically, there exists two cases where 
Spark's processing is faster than the recovery.

Case 1. When the Spark executor termination is a little slow and PVC is not 
available cleanly from K8s control plan for some some reason to Spark driver, 
Spark driver creates a new executor with a new PVC (of course, driver owned). 
In this case, you can have more PVCs than the executors. You can confirm this 
case with `kubectl` command.

Case 2. When the Spark processing is faster that Spark's executor 
allocation(Pod Creation+PVC assignment+Docker Image Downloading+...), Spark 
recomputes the lineage with the running executors without waiting new executor 
allocation (or recovery from it). It's Spark's original design. It can happen 
always.

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org