[jira] [Commented] (SPARK-40366) Add namespace for base ci image
[ https://issues.apache.org/jira/browse/SPARK-40366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601157#comment-17601157 ] Apache Spark commented on SPARK-40366: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37815 > Add namespace for base ci image > --- > > Key: SPARK-40366 > URL: https://issues.apache.org/jira/browse/SPARK-40366 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40366) Add namespace for base ci image
[ https://issues.apache.org/jira/browse/SPARK-40366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40366: Assignee: Apache Spark > Add namespace for base ci image > --- > > Key: SPARK-40366 > URL: https://issues.apache.org/jira/browse/SPARK-40366 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40366) Add namespace for base ci image
[ https://issues.apache.org/jira/browse/SPARK-40366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40366: Assignee: (was: Apache Spark) > Add namespace for base ci image > --- > > Key: SPARK-40366 > URL: https://issues.apache.org/jira/browse/SPARK-40366 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40366) Add namespace for base ci image
[ https://issues.apache.org/jira/browse/SPARK-40366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601156#comment-17601156 ] Apache Spark commented on SPARK-40366: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37815 > Add namespace for base ci image > --- > > Key: SPARK-40366 > URL: https://issues.apache.org/jira/browse/SPARK-40366 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40366) Add namespace for base ci image
Yikun Jiang created SPARK-40366: --- Summary: Add namespace for base ci image Key: SPARK-40366 URL: https://issues.apache.org/jira/browse/SPARK-40366 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.4.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40355: Description: Table Schema: | a| b| part| | int|string | string| SQL: select * from table where b = 1 and part = 1 A.Partition column 'part' has been pushed down, but column 'b' not push down. !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! was: Table Schema: | a| b| part| | | int|string | string| | SQL: select * from table where b = 1 and part = 1 A.Partition column 'part' has been pushed down, but column 'b' not push down. !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > Table Schema: > | a| b| part| > | int|string | string| > > SQL: > select * from table where b = 1 and part = 1 > A.Partition column 'part' has been pushed down, but column 'b' not push down. > !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! > B.After apply the pr, Partition column 'part' and column 'b' has been pushed > down. > !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601139#comment-17601139 ] Apache Spark commented on SPARK-40365: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37814 > Bump ANTLR runtime version from 4.8 to 4.11.1 > - > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40365: Assignee: (was: Apache Spark) > Bump ANTLR runtime version from 4.8 to 4.11.1 > - > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601138#comment-17601138 ] Apache Spark commented on SPARK-40365: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37814 > Bump ANTLR runtime version from 4.8 to 4.11.1 > - > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1
[ https://issues.apache.org/jira/browse/SPARK-40365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40365: Assignee: Apache Spark > Bump ANTLR runtime version from 4.8 to 4.11.1 > - > > Key: SPARK-40365 > URL: https://issues.apache.org/jira/browse/SPARK-40365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40365) Bump ANTLR runtime version from 4.8 to 4.11.1
BingKun Pan created SPARK-40365: --- Summary: Bump ANTLR runtime version from 4.8 to 4.11.1 Key: SPARK-40365 URL: https://issues.apache.org/jira/browse/SPARK-40365 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: BingKun Pan Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40355: Description: Table Schema: | a| b| part| | | int|string | string| | SQL: select * from table where b = 1 and part = 1 A.Partition column 'part' has been pushed down, but column 'b' not push down. !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! was: Table Schema: | a| b| part| | | int|string | string| | SQL: select * from table where b = 1 and part = 1 A.Partition column 'part' has been pushed down, but column 'b' no push down. !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > Table Schema: > | a| b| part| | > | int|string | string| | > > SQL: > select * from table where b = 1 and part = 1 > A.Partition column 'part' has been pushed down, but column 'b' not push down. > !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! > B.After apply the pr, Partition column 'part' and column 'b' has been pushed > down. > !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40364) Unify `initDB` method in `DBProvider`
Yang Jie created SPARK-40364: Summary: Unify `initDB` method in `DBProvider` Key: SPARK-40364 URL: https://issues.apache.org/jira/browse/SPARK-40364 Project: Spark Issue Type: Improvement Components: Spark Core, Tests, YARN Affects Versions: 3.4.0 Reporter: Yang Jie There are 2 `initDB` in `DBProvider` # `DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper) # DB initDB(DBBackend dbBackend, File file) and the second one only used by test ShuffleTestAccessor, we can explore whether can keep only the first method -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40364) Unify `initDB` method in `DBProvider`
[ https://issues.apache.org/jira/browse/SPARK-40364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40364: - Description: There are 2 `initDB` in `DBProvider` # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper) # DB initDB(DBBackend dbBackend, File file) and the second one only used by test ShuffleTestAccessor, we can explore whether can keep only the first method was: There are 2 `initDB` in `DBProvider` # `DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper) # DB initDB(DBBackend dbBackend, File file) and the second one only used by test ShuffleTestAccessor, we can explore whether can keep only the first method > Unify `initDB` method in `DBProvider` > - > > Key: SPARK-40364 > URL: https://issues.apache.org/jira/browse/SPARK-40364 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > There are 2 `initDB` in `DBProvider` > # DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, > ObjectMapper mapper) > # DB initDB(DBBackend dbBackend, File file) > and the second one only used by test ShuffleTestAccessor, we can explore > whether can keep only the first method > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40363) Add SQL misc function to assert/check column value
[ https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40363. -- Resolution: Won't Fix > Add SQL misc function to assert/check column value > -- > > Key: SPARK-40363 > URL: https://issues.apache.org/jira/browse/SPARK-40363 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rafal Wojdyla >Priority: Major > > SQL function that allows to assert a condition on a column that: > * fails when condition is not met > * returns original value otherwise > Related: SPARK-32793 > But {{assert_true}} and {{raise_error}} do not really cut it. In case of > {{assert_true}} you have to actually collect the empty column, and the check > might no happen if you drop the assertion column, which you will likely do > since it's empty. Having a function that returns some value as part of the > check, in most cases it would be the checked column would be handy. > I'm working with pyspark, so here's python implementation: > {code:python} > @overload > def assert_col_condition( > col: Union[str, Column], > cond: Callable[[Column], Column], > error_msg: Optional[str] = None, > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > @overload > def assert_col_condition( > col: Union[str, Column], cond: Column, error_msg: Optional[str] = None > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > def assert_col_condition( > col: Union[str, Column], > cond: Union[Column, Callable[[Column], Column]], > error_msg: Optional[str] = None, > ) -> Column: > col = str_to_col(col) > if not isinstance(cond, Column): > cond = cond(col) > return F.when( > ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}") > ).otherwise(col) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40363) Add SQL misc function to assert/check column value
[ https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601090#comment-17601090 ] Rafal Wojdyla commented on SPARK-40363: --- Hey [~hyukjin.kwon] yea, the implementation is simple, which is great. We find this useful, if you don't think this fits into spark, feel free to close this issue. Otherwise I'm happy to PR it. > Add SQL misc function to assert/check column value > -- > > Key: SPARK-40363 > URL: https://issues.apache.org/jira/browse/SPARK-40363 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rafal Wojdyla >Priority: Major > > SQL function that allows to assert a condition on a column that: > * fails when condition is not met > * returns original value otherwise > Related: SPARK-32793 > But {{assert_true}} and {{raise_error}} do not really cut it. In case of > {{assert_true}} you have to actually collect the empty column, and the check > might no happen if you drop the assertion column, which you will likely do > since it's empty. Having a function that returns some value as part of the > check, in most cases it would be the checked column would be handy. > I'm working with pyspark, so here's python implementation: > {code:python} > @overload > def assert_col_condition( > col: Union[str, Column], > cond: Callable[[Column], Column], > error_msg: Optional[str] = None, > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > @overload > def assert_col_condition( > col: Union[str, Column], cond: Column, error_msg: Optional[str] = None > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > def assert_col_condition( > col: Union[str, Column], > cond: Union[Column, Callable[[Column], Column]], > error_msg: Optional[str] = None, > ) -> Column: > col = str_to_col(col) > if not isinstance(cond, Column): > cond = cond(col) > return F.when( > ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}") > ).otherwise(col) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40356: Assignee: Yikun Jiang > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40356. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37810 [https://github.com/apache/spark/pull/37810] > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute
[ https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601078#comment-17601078 ] Apache Spark commented on SPARK-40228: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37813 > Don't simplify multiLike if child is not attribute > -- > > Key: SPARK-40228 > URL: https://issues.apache.org/jira/browse/SPARK-40228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > > {code:scala} > sql("create table t1(name string) using parquet") > sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', > '%c%')").explain(true) > {code} > {noformat} > == Physical Plan == > *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, > 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: > [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) > OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute
[ https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601077#comment-17601077 ] Apache Spark commented on SPARK-40228: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37813 > Don't simplify multiLike if child is not attribute > -- > > Key: SPARK-40228 > URL: https://issues.apache.org/jira/browse/SPARK-40228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > > {code:scala} > sql("create table t1(name string) using parquet") > sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', > '%c%')").explain(true) > {code} > {noformat} > == Physical Plan == > *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, > 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: > [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) > OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40363) Add SQL misc function to assert/check column value
[ https://issues.apache.org/jira/browse/SPARK-40363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601074#comment-17601074 ] Hyukjin Kwon commented on SPARK-40363: -- So basically this is: {code} F.when(~cond, F.raise_error("...")).otherwise(col) {code} which can be very easily worked around. I doubt if we need this. Do other Python libraries or other DBMSes have such feature? > Add SQL misc function to assert/check column value > -- > > Key: SPARK-40363 > URL: https://issues.apache.org/jira/browse/SPARK-40363 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rafal Wojdyla >Priority: Major > > SQL function that allows to assert a condition on a column that: > * fails when condition is not met > * returns original value otherwise > Related: SPARK-32793 > But {{assert_true}} and {{raise_error}} do not really cut it. In case of > {{assert_true}} you have to actually collect the empty column, and the check > might no happen if you drop the assertion column, which you will likely do > since it's empty. Having a function that returns some value as part of the > check, in most cases it would be the checked column would be handy. > I'm working with pyspark, so here's python implementation: > {code:python} > @overload > def assert_col_condition( > col: Union[str, Column], > cond: Callable[[Column], Column], > error_msg: Optional[str] = None, > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > @overload > def assert_col_condition( > col: Union[str, Column], cond: Column, error_msg: Optional[str] = None > ) -> Column: > """Asserts condition on a column, IFF it holds returns the original value > under `col`""" > ... > def assert_col_condition( > col: Union[str, Column], > cond: Union[Column, Callable[[Column], Column]], > error_msg: Optional[str] = None, > ) -> Column: > col = str_to_col(col) > if not isinstance(cond, Column): > cond = cond(col) > return F.when( > ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}") > ).otherwise(col) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601067#comment-17601067 ] Asif commented on SPARK-40362: -- will be generating a PR tomorrow.. > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601066#comment-17601066 ] Asif edited comment on SPARK-40362 at 9/7/22 12:43 AM: --- I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } } was (Author: ashahid7): I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } }{quote} > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601066#comment-17601066 ] Asif edited comment on SPARK-40362 at 9/7/22 12:42 AM: --- I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } }{quote} was (Author: ashahid7): I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } }{quote} > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601066#comment-17601066 ] Asif edited comment on SPARK-40362 at 9/7/22 12:42 AM: --- I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } }{quote} was (Author: ashahid7): I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } } {quote} > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601066#comment-17601066 ] Asif commented on SPARK-40362: -- I have added locally test for &&, ||, &, | and * and they all fail for the same reason.. in process of adding tests for xor, greatest & least and testing the fix for the same... the fix is to make these expressions implement a trait {quote}trait CommutativeExpresionCanonicalization { this: Expression => override lazy val canonicalized: Expression = preCanonicalized override lazy val preCanonicalized: Expression = { val canonicalizedChildren = children.map(_.preCanonicalized) Canonicalize.reorderCommutativeOperators { withNewChildren(canonicalizedChildren) } } } {quote} > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Asif >Priority: Major > Labels: spark-sql > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize. > Consider following expression: > a + b > 10 > This GreaterThan expression when canonicalized as a whole for first time, > will skip the call to Canonicalize.reorderCommutativeOperators for the Add > expression as the GreaterThan's canonicalization used precanonicalize on > children ( the Add expression). > > so if create a new expression > b + a > 10 and invoke canonicalize it, the canonicalized versions of these > two expressions will not match. > The test > {quote} > test("bug X") { > val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze > val fullCond = y.asInstanceOf[Filter].condition.clone() > val addExpr = (fullCond match { > case GreaterThan(x: Add, _) => x > case LessThan(_, x: Add) => x > }).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or > , multiply , and etc) canonicalize and precanonicalize should be same. > I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33605: -- Affects Version/s: 3.4.0 (was: 3.0.1) > Add gcs-connector to hadoop-cloud module > > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Rafal Wojdyla >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33605: -- Component/s: Build (was: PySpark) (was: Spark Core) > Add gcs-connector to hadoop-cloud module > > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Rafal Wojdyla >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33605: - Assignee: Dongjoon Hyun > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Assignee: Dongjoon Hyun >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33605. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37745 [https://github.com/apache/spark/pull/37745] > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33605) Add gcs-connector to hadoop-cloud module
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33605: -- Summary: Add gcs-connector to hadoop-cloud module (was: Add GCS FS/connector config (dependencies?) akin to S3) > Add gcs-connector to hadoop-cloud module > > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40363) Add SQL misc function to assert/check column value
Rafal Wojdyla created SPARK-40363: - Summary: Add SQL misc function to assert/check column value Key: SPARK-40363 URL: https://issues.apache.org/jira/browse/SPARK-40363 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Rafal Wojdyla SQL function that allows to assert a condition on a column that: * fails when condition is not met * returns original value otherwise Related: SPARK-32793 But {{assert_true}} and {{raise_error}} do not really cut it. In case of {{assert_true}} you have to actually collect the empty column, and the check might no happen if you drop the assertion column, which you will likely do since it's empty. Having a function that returns some value as part of the check, in most cases it would be the checked column would be handy. I'm working with pyspark, so here's python implementation: {code:python} @overload def assert_col_condition( col: Union[str, Column], cond: Callable[[Column], Column], error_msg: Optional[str] = None, ) -> Column: """Asserts condition on a column, IFF it holds returns the original value under `col`""" ... @overload def assert_col_condition( col: Union[str, Column], cond: Column, error_msg: Optional[str] = None ) -> Column: """Asserts condition on a column, IFF it holds returns the original value under `col`""" ... def assert_col_condition( col: Union[str, Column], cond: Union[Column, Callable[[Column], Column]], error_msg: Optional[str] = None, ) -> Column: col = str_to_col(col) if not isinstance(cond, Column): cond = cond(col) return F.when( ~cond, F.raise_error(error_msg or f"Assertion failed: {cond}") ).otherwise(col) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators
Asif created SPARK-40362: Summary: Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators Key: SPARK-40362 URL: https://issues.apache.org/jira/browse/SPARK-40362 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Asif In the canonicalization code which is now in two stages, canonicalization involving Commutative operators is broken, if they are subexpressions of certain type of expressions which override precanonicalize. Consider following expression: a + b > 10 This GreaterThan expression when canonicalized as a whole for first time, will skip the call to Canonicalize.reorderCommutativeOperators for the Add expression as the GreaterThan's canonicalization used precanonicalize on children ( the Add expression). so if create a new expression b + a > 10 and invoke canonicalize it, the canonicalized versions of these two expressions will not match. The test {quote} test("bug X") { val tr1 = LocalRelation('c.int, 'b.string, 'a.int) val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match { case GreaterThan(x: Add, _) => x case LessThan(_, x: Add) => x }).clone().asInstanceOf[Add] val canonicalizedFullCond = fullCond.canonicalized // swap the operands of add val newAddExpr = Add(addExpr.right, addExpr.left) // build a new condition which is same as the previous one, but with operands of //Add reversed val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized assertEquals(canonicalizedFullCond, builtCondnCanonicalized) } {quote} This test fails. The fix which I propose is that for the commutative expressions, the precanonicalize should be overridden and Canonicalize.reorderCommutativeOperators be invoked on the expression instead of at place of canonicalize. effectively for commutative operands ( add, or , multiply , and etc) canonicalize and precanonicalize should be same. I will file a PR for it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
[ https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601030#comment-17601030 ] Igor Uchôa edited comment on SPARK-36901 at 9/6/22 10:06 PM: - I'm facing this too. In my case, we have a shared cluster where multiple applications run on it at the same time and they use Dynamic Allocation. Depending on the cluster load, the application gets a broadcast timeout error. In my opinion, resource allocation delays shouldn't cause errors in the application, especially regarding broadcast, which seems unrelated to the problem at first glance. The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` which affects all broadcast operations. Therefore, if we give this value a huge number (or infinite), we will solve the broadcast timeout issue, but we will allow users to explicitly broadcast huge data sets in their queries, which seems really bad from my perspective. Since the broadcast exchange is a job, it will trigger the lazy evaluation, and depending on resource availability, it will start immediately, or in the worst case, it will wait until the application has enough executors. The problem is that the timer starts to count the moment the job is created and not at the moment it started the first task. So if we have a delay in the resource allocation, the application will eventually fire the broadcast timeout error: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211 My suggestion is to change the timer approach and make it starts only if the application has at least one running task. In this way we will be able to use `spark.sql.broadcastTimeout` properly. Please, let me know your thoughts was (Author: JIRAUSER293801): I'm facing this too. In my case, we have a shared cluster where multiple applications run on it at the same time and they use Dynamic Allocation. Depending on the cluster load, the application gets a broadcast timeout error. In my opinion, resource allocation delays shouldn't cause errors in the application, especially regarding broadcast, which seems unrelated to the problem at first glance. The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` which affects all broadcast operations. Therefore, if we give this value a huge number (or infinite), we will solve the broadcast timeout issue, but we will allow users to explicitly broadcast huge data sets in their queries, which seems really bad from my perspective. Since the broadcast exchange is a job, it will trigger the lazy evaluation, and depending on resource availability, it will start immediately, or in the worst case, it will wait until the application has enough executors. The problem is that the timer starts to count the moment the job is created and not at the moment it started the first task. So if we have a delay in the resource allocation, the application will eventually fire the broadcast timeout error: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211 My suggestion is to change the timer approach and make it starts only if the application has at least one running task. In this way we will be able to use `spark.sql.broadcastTimeout` properly. > ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs > - > > Key: SPARK-36901 > URL: https://issues.apache.org/jira/browse/SPARK-36901 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Ranga Reddy >Priority: Major > > While running Spark application, if there are no further resources to launch > executors, Spark application is failed after 5 mins with below exception. > {code:java} > 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > ... > 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute > broadcast in 300 secs. > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > ... > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [300 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) > ... 71 more > 21/09/24
[jira] [Commented] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
[ https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601030#comment-17601030 ] Igor Uchôa commented on SPARK-36901: I'm facing this too. In my case, we have a shared cluster where multiple applications run on it at the same time and they use Dynamic Allocation. Depending on the cluster load, the application gets a broadcast timeout error. In my opinion, resource allocation delays shouldn't cause errors in the application, especially regarding broadcast, which seems unrelated to the problem at first glance. The only parameter we have to fix this issue is `spark.sql.broadcastTimeout` which affects all broadcast operations. Therefore, if we give this value a huge number (or infinite), we will solve the broadcast timeout issue, but we will allow users to explicitly broadcast huge data sets in their queries, which seems really bad from my perspective. Since the broadcast exchange is a job, it will trigger the lazy evaluation, and depending on resource availability, it will start immediately, or in the worst case, it will wait until the application has enough executors. The problem is that the timer starts to count the moment the job is created and not at the moment it started the first task. So if we have a delay in the resource allocation, the application will eventually fire the broadcast timeout error: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L211 My suggestion is to change the timer approach and make it starts only if the application has at least one running task. In this way we will be able to use `spark.sql.broadcastTimeout` properly. > ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs > - > > Key: SPARK-36901 > URL: https://issues.apache.org/jira/browse/SPARK-36901 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Ranga Reddy >Priority: Major > > While running Spark application, if there are no further resources to launch > executors, Spark application is failed after 5 mins with below exception. > {code:java} > 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > ... > 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute > broadcast in 300 secs. > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > ... > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [300 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) > ... 71 more > 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook > {code} > *Expectation* should be either needs to be throw proper exception saying > *"there are no further resources to run the application"* or it needs to be > *"wait till it get resources"*. > To reproduce the issue we have used following sample code. > *PySpark Code (test_broadcast_timeout.py):* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() > t1 = spark.range(5) > t2 = spark.range(5) > q = t1.join(t2,t1.id == t2.id) > q.explain > q.show(){code} > *Spark Submit Command:* > {code:java} > spark-submit --executor-memory 512M test_broadcast_timeout.py{code} > Note: We have tested same code in Spark 3.1, we are able to reproduce the > issue in Spark3 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40262) Expensive UDF evaluation pushed down past a join leads to performance issues
[ https://issues.apache.org/jira/browse/SPARK-40262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601021#comment-17601021 ] Shardul Mahadik commented on SPARK-40262: - [~cloud_fan] [~viirya] [~joshrosen] Gentle ping on this! > Expensive UDF evaluation pushed down past a join leads to performance issues > - > > Key: SPARK-40262 > URL: https://issues.apache.org/jira/browse/SPARK-40262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Shardul Mahadik >Priority: Major > > Consider a Spark job with an expensive UDF which looks like follows: > {code:scala} > val expensive_udf = spark.udf.register("expensive_udf", (i: Int) => Option(i)) > spark.range(10).write.format("orc").save("/tmp/orc") > val df = spark.read.format("orc").load("/tmp/orc").as("a") > .join(spark.range(10).as("b"), "id") > .withColumn("udf_op", expensive_udf($"a.id")) > .join(spark.range(10).as("c"), $"udf_op" === $"c.id") > {code} > This creates a physical plan as follows: > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [cast(udf_op#338 as bigint)], [id#344L], Inner, > BuildRight, false >:- Project [id#330L, if (isnull(cast(id#330L as int))) null else > expensive_udf(knownnotnull(cast(id#330L as int))) AS udf_op#338] >: +- BroadcastHashJoin [id#330L], [id#332L], Inner, BuildRight, false >: :- Filter ((isnotnull(id#330L) AND isnotnull(cast(id#330L as int))) > AND isnotnull(expensive_udf(knownnotnull(cast(id#330L as int) >: : +- FileScan orc [id#330L] Batched: true, DataFilters: > [isnotnull(id#330L), isnotnull(cast(id#330L as int)), > isnotnull(expensive_udf(knownnotnull(cast(i..., Format: ORC, Location: > InMemoryFileIndex(1 paths)[file:/tmp/orc], PartitionFilters: [], > PushedFilters: [IsNotNull(id)], ReadSchema: struct >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [plan_id=416] >:+- Range (0, 10, step=1, splits=16) >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false]),false), [plan_id=420] > +- Range (0, 10, step=1, splits=16) > {code} > In this case, the expensive UDF call is duplicated thrice. Since the UDF > output is used in a future join, `InferFiltersFromConstraints` adds an `IS > NOT NULL` filter on the UDF output. But the pushdown rules duplicate this UDF > call and push the UDF past a previous join. The duplication behaviour [is > documented|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L196] > and in itself is not a huge issue. But given a highly restrictive join, the > UDF gets evaluated on many orders of magnitude more rows than it should have > slowing down the job. > Can we avoid this duplication of UDF calls? In SPARK-37392, we made a > [similar change|https://github.com/apache/spark/pull/34823/files] where we > decided to only add inferred filters if the input is an attribute. Should we > use a similar strategy for `InferFiltersFromConstraints`? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework
[ https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40360: Assignee: (was: Apache Spark) > Convert some DDL exception to new error framework > - > > Key: SPARK-40360 > URL: https://issues.apache.org/jira/browse/SPARK-40360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Tackling the following files: > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala > Here is the doc with proposed text: > https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework
[ https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40360: Assignee: Apache Spark > Convert some DDL exception to new error framework > - > > Key: SPARK-40360 > URL: https://issues.apache.org/jira/browse/SPARK-40360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > Tackling the following files: > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala > Here is the doc with proposed text: > https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40360) Convert some DDL exception to new error framework
[ https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601003#comment-17601003 ] Apache Spark commented on SPARK-40360: -- User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/37811 > Convert some DDL exception to new error framework > - > > Key: SPARK-40360 > URL: https://issues.apache.org/jira/browse/SPARK-40360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Tackling the following files: > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala > Here is the doc with proposed text: > https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40360) Convert some DDL exception to new error framework
[ https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40360: Assignee: Apache Spark > Convert some DDL exception to new error framework > - > > Key: SPARK-40360 > URL: https://issues.apache.org/jira/browse/SPARK-40360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > Tackling the following files: > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala > Here is the doc with proposed text: > https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600948#comment-17600948 ] Pankaj Nagla commented on SPARK-40149: -- Thank you for sharing the information. [Vlocity Training|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/] enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer assisting many tops and arising companies obtain their wanted progress utilizing its Omnichannel procedures. > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40318) try_avg() should throw the exceptions from its child
[ https://issues.apache.org/jira/browse/SPARK-40318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40318. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37775 [https://github.com/apache/spark/pull/37775] > try_avg() should throw the exceptions from its child > > > Key: SPARK-40318 > URL: https://issues.apache.org/jira/browse/SPARK-40318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > milar to [https://github.com/apache/spark/pull/37486] and > [https://github.com/apache/spark/pull/37663,] the errors from try_avg()'s > child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600894#comment-17600894 ] Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:03 PM: --- Thank you for sharing the information. [React Native Online Course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and the web. was (Author: JIRAUSER295436): Thank you for sharing the information. [url=https://www.igmguru.com/digital-marketing-programming/react-native-training/]React Native Online Course[/url] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) ---
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600894#comment-17600894 ] Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:01 PM: --- Thank you for sharing the information. [url=https://www.igmguru.com/digital-marketing-programming/react-native-training/]React Native Online Course[/url] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. was (Author: JIRAUSER295436): Thank you for sharing the information. https://www.igmguru.com/digital-marketing-programming/react-native-training/";>React Native Online Course is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600894#comment-17600894 ] Nikhil Sharma edited comment on SPARK-22588 at 9/6/22 5:00 PM: --- Thank you for sharing the information. https://www.igmguru.com/digital-marketing-programming/react-native-training/";>React Native Online Course is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. was (Author: JIRAUSER295436): Thank you for sharing the information. [React Native Online Course|[https://www.igmguru.com/digital-marketing-programming/react-native-training/]] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) ---
[jira] [Updated] (SPARK-40361) Migrate arithmetic type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40361: - Description: Replace TypeCheckFailure by DataTypeMismatch in type checks in arithmetic expressions: 1. Least (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1188-L1191 2. Greatest (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1266-L1269 was: Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. WindowSpecDefinition (4): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 2. SpecifiedWindowFrame (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 3. checkBoundary (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 4. FrameLessOffsetWindowFunction (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 5. NthValue (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 6. NTile (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 > Migrate arithmetic type check failures onto error classes > - > > Key: SPARK-40361 > URL: https://issues.apache.org/jira/browse/SPARK-40361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in arithmetic > expressions: > 1. Least (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1188-L1191 > 2. Greatest (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L1266-L1269 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40361) Migrate arithmetic type check failures onto error classes
Max Gekk created SPARK-40361: Summary: Migrate arithmetic type check failures onto error classes Key: SPARK-40361 URL: https://issues.apache.org/jira/browse/SPARK-40361 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. WindowSpecDefinition (4): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 2. SpecifiedWindowFrame (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 3. checkBoundary (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 4. FrameLessOffsetWindowFunction (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 5. NthValue (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 6. NTile (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600894#comment-17600894 ] Nikhil Sharma commented on SPARK-22588: --- Thank you for sharing the information. [React Native Online Course|[https://www.igmguru.com/digital-marketing-programming/react-native-training/]] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and web. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40357) Migrate window type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40357: - Summary: Migrate window type check failures onto error classes (was: Migrate window type checks onto error classes) > Migrate window type check failures onto error classes > - > > Key: SPARK-40357 > URL: https://issues.apache.org/jira/browse/SPARK-40357 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in window > expressions: > 1. WindowSpecDefinition (4): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 > 2. SpecifiedWindowFrame (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 > 3. checkBoundary (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 > 4. FrameLessOffsetWindowFunction (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 > 5. NthValue (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 > 6. NTile (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40358) Migrate collection type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40358: - Summary: Migrate collection type check failures onto error classes (was: Migrate collection type checks onto error classes) > Migrate collection type check failures onto error classes > - > > Key: SPARK-40358 > URL: https://issues.apache.org/jira/browse/SPARK-40358 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in collection > expressions: > 1. BinaryArrayExpressionWithImplicitCast (1): > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] > 2. MapContainsKey (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 > 3. MapConcat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 > 4. MapFromEntries (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 > 5. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 > 6. ArrayContains (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 > 7. ArrayPosition (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 > 8. ElementAt (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 > 9. Concat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 > 10. Flatten (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 > 11. Sequence (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 > 12. ArrayRemove (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 > 13. ArrayDistinct (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 > 14. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40359) Migrate JSON type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40359: - Summary: Migrate JSON type check failures onto error classes (was: Migrate JSON type checks onto error classes) > Migrate JSON type check failures onto error classes > --- > > Key: SPARK-40359 > URL: https://issues.apache.org/jira/browse/SPARK-40359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON > expressions: > 1. JsonTuple (2): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399 > 2. JsonToStructs (2): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574 > 3. StructsToJson (4): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743 > 4. SchemaOfJson (1): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40359) Migrate JSON type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40359: - Description: Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON expressions: 1. JsonTuple (2): https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399 2. JsonToStructs (2): https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574 3. StructsToJson (4): https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743 4. SchemaOfJson (1): https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806 was: Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. BinaryArrayExpressionWithImplicitCast (1): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] 2. MapContainsKey (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 3. MapConcat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 4. MapFromEntries (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 5. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 6. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 7. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 8. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 9. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 10. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 11. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 12. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 13. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 14. > Migrate JSON type checks onto error classes > --- > > Key: SPARK-40359 > URL: https://issues.apache.org/jira/browse/SPARK-40359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in JSON > expressions: > 1. JsonTuple (2): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L395-L399 > 2. JsonToStructs (2): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L570-L574 > 3. StructsToJson (4): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L725-L743 > 4. SchemaOfJson (1): > https://github.com/apache/spark/blob/9b885ae3ba70cd97489e8768a335c9b9c10e9d87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L806 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40360) Convert some DDL exception to new error framework
Serge Rielau created SPARK-40360: Summary: Convert some DDL exception to new error framework Key: SPARK-40360 URL: https://issues.apache.org/jira/browse/SPARK-40360 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau Tackling the following files: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala Here is the doc with proposed text: https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40358) Migrate collection type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40358: - Description: Replace TypeCheckFailure by DataTypeMismatch in type checks in collection expressions: 1. BinaryArrayExpressionWithImplicitCast (1): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] 2. MapContainsKey (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 3. MapConcat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 4. MapFromEntries (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 5. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 6. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 7. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 8. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 9. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 10. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 11. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 12. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 13. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 14. was: Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. BinaryArrayExpressionWithImplicitCast (1): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] 2. MapContainsKey (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 3. MapConcat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 4. MapFromEntries (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 5. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 6. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 7. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 8. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 9. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 10. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 11. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 12. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 13. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 14. > Migrate collection type checks onto error classes > - > >
[jira] [Created] (SPARK-40359) Migrate JSON type checks onto error classes
Max Gekk created SPARK-40359: Summary: Migrate JSON type checks onto error classes Key: SPARK-40359 URL: https://issues.apache.org/jira/browse/SPARK-40359 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. BinaryArrayExpressionWithImplicitCast (1): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] 2. MapContainsKey (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 3. MapConcat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 4. MapFromEntries (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 5. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 6. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 7. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 8. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 9. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 10. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 11. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 12. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 13. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 14. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40358) Migrate collection type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40358: - Description: Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. BinaryArrayExpressionWithImplicitCast (1): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] 2. MapContainsKey (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 3. MapConcat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 4. MapFromEntries (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 5. SortArray (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 6. ArrayContains (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 7. ArrayPosition (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 8. ElementAt (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 9. Concat (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 10. Flatten (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 11. Sequence (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 12. ArrayRemove (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 13. ArrayDistinct (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 14. was: Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. WindowSpecDefinition (4): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 2. SpecifiedWindowFrame (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 3. checkBoundary (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 4. FrameLessOffsetWindowFunction (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 5. NthValue (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 6. NTile (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 > Migrate collection type checks onto error classes > - > > Key: SPARK-40358 > URL: https://issues.apache.org/jira/browse/SPARK-40358 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in window > expressions: > 1. BinaryArrayExpressionWithImplicitCast (1): > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L69] > 2. MapContainsKey (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L231-L237 > 3. MapConcat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L663 > 4. MapFromEntries (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L801 > 5. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catal
[jira] [Created] (SPARK-40358) Migrate collection type checks onto error classes
Max Gekk created SPARK-40358: Summary: Migrate collection type checks onto error classes Key: SPARK-40358 URL: https://issues.apache.org/jira/browse/SPARK-40358 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. WindowSpecDefinition (4): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 2. SpecifiedWindowFrame (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 3. checkBoundary (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 4. FrameLessOffsetWindowFunction (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 5. NthValue (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 6. NTile (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40357) Migrate window type checks onto error classes
Max Gekk created SPARK-40357: Summary: Migrate window type checks onto error classes Key: SPARK-40357 URL: https://issues.apache.org/jira/browse/SPARK-40357 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions: 1. WindowSpecDefinition (4): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 2. SpecifiedWindowFrame (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 3. checkBoundary (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 4. FrameLessOffsetWindowFunction (1): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 5. NthValue (2): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L697-L700 6. NTile (3): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L793-L804 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600777#comment-17600777 ] wuyi commented on SPARK-40320: -- > Actually the `CoarseGrainedExecutorBackend` JVM process is active but the > communication thread is no longer working ( please see > `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor > doesn't receive any message) Hmm, shouldn't it bring up a new `receiveLoop()` to serve RPC messages according to [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . Why it doesn't? Besides, why SparkUncaughtExceptionHandler doesn't catch the fatal error? cc [~tgraves] [~mridulm80] > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > --- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Mars >Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** >*/ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** >*/ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM > process is active but the communication thread is no longer working ( > please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, > so executor doesn't receive any message) > Some ideas: > I think it is very hard to know what happened here unless we check in the > code. The Executor is active but it can't do anything. We will wonder if the > driver is broken or the Executor problem. I think at least the Executor > status shouldn't be active here or the Executor can exitExecutor (kill itself) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
[ https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40315: --- Assignee: Carmen Kwan > Non-deterministic hashCode() calculations for ArrayBasedMapData on equal > objects > > > Key: SPARK-40315 > URL: https://issues.apache.org/jira/browse/SPARK-40315 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Carmen Kwan >Assignee: Carmen Kwan >Priority: Major > Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3 > > > There is no explicit `hashCode()` function override for the > `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for > `ArrayBasedMapData` can be different for two equal objects (objects with > equal keys and values). > This error is non-deterministic and hard to reproduce, as we don't control > the default `hashCode()` function. > We should override the `hashCode` function so that it works exactly as we > expect. We should also have an explicit `equals()` function for consistency > with how `Literals` check for equality of `ArrayBasedMapData`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
[ https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40315. - Fix Version/s: 3.3.1 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37807 [https://github.com/apache/spark/pull/37807] > Non-deterministic hashCode() calculations for ArrayBasedMapData on equal > objects > > > Key: SPARK-40315 > URL: https://issues.apache.org/jira/browse/SPARK-40315 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Carmen Kwan >Priority: Major > Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0 > > > There is no explicit `hashCode()` function override for the > `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for > `ArrayBasedMapData` can be different for two equal objects (objects with > equal keys and values). > This error is non-deterministic and hard to reproduce, as we don't control > the default `hashCode()` function. > We should override the `hashCode` function so that it works exactly as we > expect. We should also have an explicit `equals()` function for consistency > with how `Literals` check for equality of `ArrayBasedMapData`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600766#comment-17600766 ] Jayita Panda commented on SPARK-38330: -- Can someone elaborate the resolution here?Facing the same problem cc Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://com.twilio.dev.warehouse/data/warehouse/sfdc-user/v1.0/1/partitions/dt=20201012T/sfdc-user0/2020-10-12T.json.gz: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.a
[jira] [Resolved] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate
[ https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40352. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37804 [https://github.com/apache/spark/pull/37804] > Add function aliases: len, datepart, dateadd, date_diff and curdate > --- > > Key: SPARK-40352 > URL: https://issues.apache.org/jira/browse/SPARK-40352 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > The functions len, datepart, dateadd, date_diff and curdate exist in other > systems, and Spark SQL has similar functions. So, adding such aliases will > make the migration to Spark SQL easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600756#comment-17600756 ] Yikun Jiang commented on SPARK-40327: - [~podongfeng] Thanks, will work on it! > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600755#comment-17600755 ] Yikun Jiang commented on SPARK-40332: - working on this > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600754#comment-17600754 ] Apache Spark commented on SPARK-40356: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37810 > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40356: Assignee: Apache Spark > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40356: Assignee: (was: Apache Spark) > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40356) Upgrade pandas to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-40356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600753#comment-17600753 ] Apache Spark commented on SPARK-40356: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37810 > Upgrade pandas to 1.4.4 > --- > > Key: SPARK-40356 > URL: https://issues.apache.org/jira/browse/SPARK-40356 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40356) Upgrade pandas to 1.4.4
Yikun Jiang created SPARK-40356: --- Summary: Upgrade pandas to 1.4.4 Key: SPARK-40356 URL: https://issues.apache.org/jira/browse/SPARK-40356 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40355: Description: Table Schema: | a| b| part| | | int|string | string| | SQL: select * from table where b = 1 and part = 1 A.Partition column 'part' has been pushed down, but column 'b' no push down. !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! was: Table Schema: | a | b | part | |---|---|---| | int | string | string | SQL: select * from table where b = 1 and part = 1 ### A.Partition column 'part' has been pushed down, but column 'b' no push down. https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png";> ### B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png";> > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > Table Schema: > | a| b| part| | > | int|string | string| | > > SQL: > select * from table where b = 1 and part = 1 > A.Partition column 'part' has been pushed down, but column 'b' no push down. > !https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png! > B.After apply the pr, Partition column 'part' and column 'b' has been pushed > down. > !https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40355: Description: Table Schema: | a | b | part | |---|---|---| | int | string | string | SQL: select * from table where b = 1 and part = 1 ### A.Partition column 'part' has been pushed down, but column 'b' no push down. https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png";> ### B.After apply the pr, Partition column 'part' and column 'b' has been pushed down. https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png";> was:I will supplement the scene later > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > Table Schema: > | a | b | part | > |---|---|---| > | int | string | string | > SQL: > select * from table where b = 1 and part = 1 > ### A.Partition column 'part' has been pushed down, but column 'b' no push > down. > src="https://user-images.githubusercontent.com/15246973/188630971-4eed59c0-d134-4994-a7aa-0d057a49c5dc.png";> > ### B.After apply the pr, Partition column 'part' and column 'b' has been > pushed down. > src="https://user-images.githubusercontent.com/15246973/188631231-d30db19d-1d70-4712-bf75-bccaa2c790f4.png";> -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600737#comment-17600737 ] Apache Spark commented on SPARK-40355: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37809 > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > I will supplement the scene later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40355: Assignee: Apache Spark > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > > I will supplement the scene later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600736#comment-17600736 ] Apache Spark commented on SPARK-40355: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37809 > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > I will supplement the scene later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
[ https://issues.apache.org/jira/browse/SPARK-40355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40355: Assignee: (was: Apache Spark) > Improve pushdown for orc & parquet when cast scenario > - > > Key: SPARK-40355 > URL: https://issues.apache.org/jira/browse/SPARK-40355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > I will supplement the scene later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40355) Improve pushdown for orc & parquet when cast scenario
BingKun Pan created SPARK-40355: --- Summary: Improve pushdown for orc & parquet when cast scenario Key: SPARK-40355 URL: https://issues.apache.org/jira/browse/SPARK-40355 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: BingKun Pan Fix For: 3.4.0 I will supplement the scene later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39830) Add a test case to read ORC table that requires type promotion
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600731#comment-17600731 ] Apache Spark commented on SPARK-39830: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/37808 > Add a test case to read ORC table that requires type promotion > -- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.4.0 > > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
[ https://issues.apache.org/jira/browse/SPARK-40315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600708#comment-17600708 ] Apache Spark commented on SPARK-40315: -- User 'c27kwan' has created a pull request for this issue: https://github.com/apache/spark/pull/37807 > Non-deterministic hashCode() calculations for ArrayBasedMapData on equal > objects > > > Key: SPARK-40315 > URL: https://issues.apache.org/jira/browse/SPARK-40315 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Carmen Kwan >Priority: Major > > There is no explicit `hashCode()` function override for the > `ArrayBasedMapData` LogicalPlan. As a result, the `hashCode()` computed for > `ArrayBasedMapData` can be different for two equal objects (objects with > equal keys and values). > This error is non-deterministic and hard to reproduce, as we don't control > the default `hashCode()` function. > We should override the `hashCode` function so that it works exactly as we > expect. We should also have an explicit `equals()` function for consistency > with how `Literals` check for equality of `ArrayBasedMapData`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40354) Support eliminate dynamic partition for v1 writes
[ https://issues.apache.org/jira/browse/SPARK-40354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-40354: -- Description: v1 writes will add an extra sort for dynamic columns, e.g. {code:java} INSERT INTO TABLE t1 PARTITION(p) SELECT c1, c2, 'a' as p FROM t2 {code} if the dynamic columns are foldable, we can optimize it to: {code:java} INSERT INTO TABLE t1 PARTITION(p='a') SELECT c1, c2 FROM t2 {code} was: v1 writes will add a extra sort for dynamic columns, e.g. {code:java} INSERT INTO TABLE t1 PARTITION(p) SELECT c1, c2, 'a' as p FROM t2 {code} if the dynamic columns are foldable, we can optimize it to: {code:java} INSERT INTO TABLE t1 PARTITION(p='a') SELECT c1, c2 FROM t2 {code} > Support eliminate dynamic partition for v1 writes > - > > Key: SPARK-40354 > URL: https://issues.apache.org/jira/browse/SPARK-40354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > v1 writes will add an extra sort for dynamic columns, e.g. > {code:java} > INSERT INTO TABLE t1 PARTITION(p) > SELECT c1, c2, 'a' as p FROM t2 {code} > if the dynamic columns are foldable, we can optimize it to: > {code:java} > INSERT INTO TABLE t1 PARTITION(p='a') > SELECT c1, c2 FROM t2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40354) Support eliminate dynamic partition for v1 writes
XiDuo You created SPARK-40354: - Summary: Support eliminate dynamic partition for v1 writes Key: SPARK-40354 URL: https://issues.apache.org/jira/browse/SPARK-40354 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You v1 writes will add a extra sort for dynamic columns, e.g. {code:java} INSERT INTO TABLE t1 PARTITION(p) SELECT c1, c2, 'a' as p FROM t2 {code} if the dynamic columns are foldable, we can optimize it to: {code:java} INSERT INTO TABLE t1 PARTITION(p='a') SELECT c1, c2 FROM t2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40324) Provide a query context of ParseException
[ https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40324. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37794 [https://github.com/apache/spark/pull/37794] > Provide a query context of ParseException > - > > Key: SPARK-40324 > URL: https://issues.apache.org/jira/browse/SPARK-40324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40300) Migrate onto the DATATYPE_MISMATCH error classes
[ https://issues.apache.org/jira/browse/SPARK-40300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40300. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37744 [https://github.com/apache/spark/pull/37744] > Migrate onto the DATATYPE_MISMATCH error classes > > > Key: SPARK-40300 > URL: https://issues.apache.org/jira/browse/SPARK-40300 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Introduce new error class DATATYPE_MISMATCH w/ some sub-classes, and convert > TypeCheckFailure to another class w/ error sub-classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
[ https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600669#comment-17600669 ] Oliver Koeth commented on SPARK-39546: -- Yes, looks like this should do it. Thank you > Respect port defininitions on K8S pod templates for both driver and executor > > > Key: SPARK-39546 > URL: https://issues.apache.org/jira/browse/SPARK-39546 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Oliver Koeth >Priority: Minor > > *Description:* > Spark on K8S allows to open additional ports for custom purposes on the > driver pod via the pod template, but ignores the port specification in the > executor pod template. Port specifications from the pod template should be > preserved (and extended) for both drivers and executors. > *Scenario:* > I want to run functionality in the executor that exposes data on an > additional port. In my case, this is monitoring data exposed by Spark's JMX > metrics sink via the JMX prometheus exporter java agent > https://github.com/prometheus/jmx_exporter -- the java agent opens an extra > port inside the container, but for prometheus to detect and scrape the port, > it must be exposed in the K8S pod resource. > (More background if desired: This seems to be the "classic" Spark 2 way to > expose prometheus metrics. Spark 3 introduced a native equivalent servlet for > the driver, but for the executor, only a rather limited set of metrics is > forwarded via the driver, and that also follows a completely different naming > scheme. So the JMX + exporter approach still turns out to be more useful for > me, even in Spark 3) > Expected behavior: > I add the following to my pod template to expose the extra port opened by the > JMX exporter java agent > spec: > containers: > - ... > ports: > - containerPort: 8090 > name: jmx-prometheus > protocol: TCP > Observed behavior: > The port is exposed for driver pods but not for executor pods > *Corresponding code:* > driver pod creation just adds ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] > (currently line 115) > val driverContainer = new ContainerBuilder(pod.container) > ... > .addNewPort() > ... > .addNewPort() > while executor pod creation replaces the ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] > (currently line 211) > val executorContainer = new ContainerBuilder(pod.container) > ... > .withPorts(requiredPorts.asJava) > The current handling is incosistent and unnecessarily limiting. It seems that > the executor creation could/should just as well preserve pods from the > template and add extra required ports. > *Workaround:* > It is possible to work around this limitation by adding a full sidecar > container to the executor pod spec which declares the port. Sidecar > containers are left unchanged by pod template handling. > As all containers in a pod share the same network, it does not matter which > container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600631#comment-17600631 ] Dongjoon Hyun commented on SPARK-40298: --- Thank you for trying Apache Spark feature, [~todd5167] , but as [~hyukjin.kwon] mentioned, this is more like a question. First, could you provide a reproducible test case for your case? I want to help you. Second, I assume that you verified KubernetesLocalDiskShuffleExecutorComponents logs correctly. However, the following could be partial observation. {quote}It can be confirmed that the pvc has been multiplexed by other pods, and the Index and data data information has been sent {quote} SPARK-35593 was designed to help the recovery and to improve the stability at the best effort approach without any regressions which mean SPARK-35593 doesn't aim to block the existing Spark features like re-computation or executor allocation with new PVC. More specifically, there exists two cases where Spark's processing is faster than the recovery. Case 1. When the Spark executor termination is a little slow and PVC is not available cleanly from K8s control plan for some some reason to Spark driver, Spark driver creates a new executor with a new PVC (of course, driver owned). In this case, you can have more PVCs than the executors. You can confirm this case with `kubectl` command. Case 2. When the Spark processing is faster that Spark&K8s's executor allocation(Pod Creation+PVC assignment+Docker Image Downloading+...), Spark recomputes the lineage with the running executors without waiting new executor allocation (or recovery from it). It's Spark's original design. It can happen always. > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org