[jira] [Assigned] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36776: Assignee: (was: Apache Spark) > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36776: Assignee: Apache Spark > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Assignee: Apache Spark >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417015#comment-17417015 ] Apache Spark commented on SPARK-36776: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34037 > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36796) Make all unit tests pass on Java 17
[ https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417013#comment-17417013 ] Yang Jie commented on SPARK-36796: -- i'm working on this > Make all unit tests pass on Java 17 > --- > > Key: SPARK-36796 > URL: https://issues.apache.org/jira/browse/SPARK-36796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36796) Make all unit tests pass on Java 17
[ https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417013#comment-17417013 ] Yang Jie edited comment on SPARK-36796 at 9/18/21, 4:25 AM: I'm working on this was (Author: luciferyang): i'm working on this > Make all unit tests pass on Java 17 > --- > > Key: SPARK-36796 > URL: https://issues.apache.org/jira/browse/SPARK-36796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36796) Make all unit tests pass on Java 17
Yuming Wang created SPARK-36796: --- Summary: Make all unit tests pass on Java 17 Key: SPARK-36796 URL: https://issues.apache.org/jira/browse/SPARK-36796 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416987#comment-17416987 ] suheng.cloud commented on SPARK-36776: -- Thank you Hyukjin & Huaxin~ > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36762) Fix Series.isin when Series has NaN values
[ https://issues.apache.org/jira/browse/SPARK-36762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36762. --- Fix Version/s: 3.2.0 Assignee: dgd_contributor Resolution: Fixed Issue resolved by pull request 34005 https://github.com/apache/spark/pull/34005 > Fix Series.isin when Series has NaN values > -- > > Key: SPARK-36762 > URL: https://issues.apache.org/jira/browse/SPARK-36762 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: dgd_contributor >Assignee: dgd_contributor >Priority: Major > Fix For: 3.2.0 > > > {code:python} > >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0]) > >>> psser = ps.from_pandas(pser) > >>> pser.isin([1, 3, 5, None]) > 0False > 1 True > 2False > 3 True > 4False > 5 True > 6False > 7False > 8False > dtype: bool > >>> psser.isin([1, 3, 5, None]) > 0None > > 1True > 2None > 3True > 4None > 5True > 6None > 7None > 8None > dtype: object > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416944#comment-17416944 ] Huaxin Gao commented on SPARK-36776: This is fixed in Spark master/3.2 in this PR https://github.com/apache/spark/pull/33191. I will open a PR to back port the fix in 3.1. > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
[ https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36795: Assignee: (was: Apache Spark) > Explain Formatted has Duplicated Node IDs with InMemoryRelation Present > --- > > Key: SPARK-36795 > URL: https://issues.apache.org/jira/browse/SPARK-36795 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Michael Chen >Priority: Major > > When a query contains an InMemoryRelation, the output of Explain Formatted > will contain duplicate node IDs. > {code:java} > == Physical Plan == > AdaptiveSparkPlan (14) > +- == Final Plan == >* BroadcastHashJoin Inner BuildLeft (9) >:- BroadcastQueryStage (5) >: +- BroadcastExchange (4) >: +- * Filter (3) >:+- * ColumnarToRow (2) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >: +- Scan parquet default.t1 (3) >+- * Filter (8) > +- * ColumnarToRow (7) > +- Scan parquet default.t2 (6) > +- == Initial Plan == >BroadcastHashJoin Inner BuildLeft (13) >:- BroadcastExchange (11) >: +- Filter (10) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >:+- Scan parquet default.t1 (3) >+- Filter (12) > +- Scan parquet default.t2 (6) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
[ https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416941#comment-17416941 ] Apache Spark commented on SPARK-36795: -- User 'ChenMichael' has created a pull request for this issue: https://github.com/apache/spark/pull/34036 > Explain Formatted has Duplicated Node IDs with InMemoryRelation Present > --- > > Key: SPARK-36795 > URL: https://issues.apache.org/jira/browse/SPARK-36795 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Michael Chen >Priority: Major > > When a query contains an InMemoryRelation, the output of Explain Formatted > will contain duplicate node IDs. > {code:java} > == Physical Plan == > AdaptiveSparkPlan (14) > +- == Final Plan == >* BroadcastHashJoin Inner BuildLeft (9) >:- BroadcastQueryStage (5) >: +- BroadcastExchange (4) >: +- * Filter (3) >:+- * ColumnarToRow (2) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >: +- Scan parquet default.t1 (3) >+- * Filter (8) > +- * ColumnarToRow (7) > +- Scan parquet default.t2 (6) > +- == Initial Plan == >BroadcastHashJoin Inner BuildLeft (13) >:- BroadcastExchange (11) >: +- Filter (10) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >:+- Scan parquet default.t1 (3) >+- Filter (12) > +- Scan parquet default.t2 (6) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
[ https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36795: Assignee: Apache Spark > Explain Formatted has Duplicated Node IDs with InMemoryRelation Present > --- > > Key: SPARK-36795 > URL: https://issues.apache.org/jira/browse/SPARK-36795 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Michael Chen >Assignee: Apache Spark >Priority: Major > > When a query contains an InMemoryRelation, the output of Explain Formatted > will contain duplicate node IDs. > {code:java} > == Physical Plan == > AdaptiveSparkPlan (14) > +- == Final Plan == >* BroadcastHashJoin Inner BuildLeft (9) >:- BroadcastQueryStage (5) >: +- BroadcastExchange (4) >: +- * Filter (3) >:+- * ColumnarToRow (2) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >: +- Scan parquet default.t1 (3) >+- * Filter (8) > +- * ColumnarToRow (7) > +- Scan parquet default.t2 (6) > +- == Initial Plan == >BroadcastHashJoin Inner BuildLeft (13) >:- BroadcastExchange (11) >: +- Filter (10) >: +- InMemoryTableScan (1) >: +- InMemoryRelation (2) >: +- * ColumnarToRow (4) >:+- Scan parquet default.t1 (3) >+- Filter (12) > +- Scan parquet default.t2 (6) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
Michael Chen created SPARK-36795: Summary: Explain Formatted has Duplicated Node IDs with InMemoryRelation Present Key: SPARK-36795 URL: https://issues.apache.org/jira/browse/SPARK-36795 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Michael Chen When a query contains an InMemoryRelation, the output of Explain Formatted will contain duplicate node IDs. {code:java} == Physical Plan == AdaptiveSparkPlan (14) +- == Final Plan == * BroadcastHashJoin Inner BuildLeft (9) :- BroadcastQueryStage (5) : +- BroadcastExchange (4) : +- * Filter (3) :+- * ColumnarToRow (2) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) : +- Scan parquet default.t1 (3) +- * Filter (8) +- * ColumnarToRow (7) +- Scan parquet default.t2 (6) +- == Initial Plan == BroadcastHashJoin Inner BuildLeft (13) :- BroadcastExchange (11) : +- Filter (10) : +- InMemoryTableScan (1) : +- InMemoryRelation (2) : +- * ColumnarToRow (4) :+- Scan parquet default.t1 (3) +- Filter (12) +- Scan parquet default.t2 (6) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36793) [K8S] Support write container stdout/stderr to file
[ https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36793: Assignee: Apache Spark > [K8S] Support write container stdout/stderr to file > > > Key: SPARK-36793 > URL: https://issues.apache.org/jira/browse/SPARK-36793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > Currently, executor and driver pod only redirect stdout/stderr. If users want > to sidecar logging agent to send stdout/stderr to external log storage, only > way is to change entrypoint.sh, which might break compatibility with > community version. > We should support this feature, and this feature could be enabled by spark > config. Related spark configs are: > |Key|Default|Desc| > |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver > stdout/stderr as log file| > |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write > executor/driver stdout/stderr as log file| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36793) [K8S] Support write container stdout/stderr to file
[ https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36793: Assignee: (was: Apache Spark) > [K8S] Support write container stdout/stderr to file > > > Key: SPARK-36793 > URL: https://issues.apache.org/jira/browse/SPARK-36793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, executor and driver pod only redirect stdout/stderr. If users want > to sidecar logging agent to send stdout/stderr to external log storage, only > way is to change entrypoint.sh, which might break compatibility with > community version. > We should support this feature, and this feature could be enabled by spark > config. Related spark configs are: > |Key|Default|Desc| > |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver > stdout/stderr as log file| > |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write > executor/driver stdout/stderr as log file| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36793) [K8S] Support write container stdout/stderr to file
[ https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416896#comment-17416896 ] Apache Spark commented on SPARK-36793: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/34035 > [K8S] Support write container stdout/stderr to file > > > Key: SPARK-36793 > URL: https://issues.apache.org/jira/browse/SPARK-36793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, executor and driver pod only redirect stdout/stderr. If users want > to sidecar logging agent to send stdout/stderr to external log storage, only > way is to change entrypoint.sh, which might break compatibility with > community version. > We should support this feature, and this feature could be enabled by spark > config. Related spark configs are: > |Key|Default|Desc| > |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver > stdout/stderr as log file| > |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write > executor/driver stdout/stderr as log file| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36794: Assignee: (was: Apache Spark) > Ignore duplicated join keys when building relation for SEMI/ANTI hash join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416879#comment-17416879 ] Apache Spark commented on SPARK-36794: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34034 > Ignore duplicated join keys when building relation for SEMI/ANTI hash join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36794: Assignee: Apache Spark > Ignore duplicated join keys when building relation for SEMI/ANTI hash join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-36794: - Summary: Ignore duplicated join keys when building relation for SEMI/ANTI hash join (was: Ignore duplicated join keys when building relation for LEFT/ANTI hash join) > Ignore duplicated join keys when building relation for SEMI/ANTI hash join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36794) Ignore duplicated join keys when building relation for LEFT/ANTI hash join
Cheng Su created SPARK-36794: Summary: Ignore duplicated join keys when building relation for LEFT/ANTI hash join Key: SPARK-36794 URL: https://issues.apache.org/jira/browse/SPARK-36794 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Cheng Su For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:25 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw regarding the S3 prefix, if I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3, this was intended for local files only. Feel free to add any other capabilities. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:23 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:21 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos commented on SPARK-23153: - [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36793) [K8S] Support write container stdout/stderr to file
Zhongwei Zhu created SPARK-36793: Summary: [K8S] Support write container stdout/stderr to file Key: SPARK-36793 URL: https://issues.apache.org/jira/browse/SPARK-36793 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.2 Reporter: Zhongwei Zhu Currently, executor and driver pod only redirect stdout/stderr. If users want to sidecar logging agent to send stdout/stderr to external log storage, only way is to change entrypoint.sh, which might break compatibility with community version. We should support this feature, and this feature could be enabled by spark config. Related spark configs are: |Key|Default|Desc| |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver stdout/stderr as log file| |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write executor/driver stdout/stderr as log file| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36792: Assignee: Apache Spark > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416799#comment-17416799 ] Apache Spark commented on SPARK-36792: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34033 > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36792: Assignee: (was: Apache Spark) > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36673) Incorrect Unions of struct with mismatched field name case
[ https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416782#comment-17416782 ] Apache Spark commented on SPARK-36673: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/34032 > Incorrect Unions of struct with mismatched field name case > -- > > Key: SPARK-36673 > URL: https://issues.apache.org/jira/browse/SPARK-36673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Shardul Mahadik >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > If a nested field has different casing on two sides of the union, the > resultant schema of the union will both fields in its schemaa > {code:java} > scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS > INNER"))) > df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner"))) > df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > scala> df1.union(df2).printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- INNER: long (nullable = false) > ||-- inner: long (nullable = false) > {code} > This seems like a bug. I would expect that Spark SQL would either just union > by index or if the user has requested {{unionByName}}, then it should matched > fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}. > However the output data only has one nested column > {code:java} > scala> df1.union(df2).show() > +---+--+ > | id|nested| > +---+--+ > | 0| {0}| > | 1| {5}| > | 0| {0}| > | 1| {5}| > +---+--+ > {code} > Trying to project fields of {{nested}} throws an error: > {code:java} > scala> df1.union(df2).select("nested.*").show() > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63) > at > org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at >
[jira] [Updated] (SPARK-33772) Build and Run Spark on Java 17
[ https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33772: -- Labels: releasenotes (was: ) > Build and Run Spark on Java 17 > -- > > Key: SPARK-33772 > URL: https://issues.apache.org/jira/browse/SPARK-33772 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: releasenotes > > Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is > 17. > ||Version||Release Date|| > |Java 17 (LTS)|September 2021| > Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along > with the release branch cut. > - https://spark.apache.org/versioning-policy.html > Supporting new Java version is considered as a new feature which we cannot > allow to backport. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36772) FinalizeShuffleMerge fails with an exception due to attempt id not matching
[ https://issues.apache.org/jira/browse/SPARK-36772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-36772: -- Target Version/s: 3.2.0 > FinalizeShuffleMerge fails with an exception due to attempt id not matching > --- > > Key: SPARK-36772 > URL: https://issues.apache.org/jira/browse/SPARK-36772 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Priority: Blocker > > As part of driver request to external shuffle services (ESS) to finalize the > merge, it also passes its [application attempt > id|https://github.com/apache/spark/blob/3f09093a21306b0fbcb132d4c9f285e56ac6b43c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java#L180] > so that ESS can validate the request is from the correct attempt. > This attempt id is fetched from the TransportConf passed in when creating the > [ExternalBlockStoreClient|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkEnv.scala#L352] > - and the transport conf leverages a [cloned > copy|https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala#L47] > of the SparkConf passed to it. > Application attempt id is set as part of SparkContext > [initialization|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L586]. > But this happens after driver SparkEnv has [already been > created|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L460]. > Hence the attempt id that ExternalBlockStoreClient uses will always end up > being -1 : which will not match the attempt id at ESS (which is based on > spark.app.attempt.id) : resulting in merge finalization to always fail (" > java.lang.IllegalArgumentException: The attempt id -1 in this > FinalizeShuffleMerge message does not match with the current attempt id 1 > stored in shuffle service for application ...") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416743#comment-17416743 ] angerszhu commented on SPARK-36792: --- raise a pr soon > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
angerszhu created SPARK-36792: - Summary: Inset should handle Double.NaN and Float.NaN Key: SPARK-36792 URL: https://issues.apache.org/jira/browse/SPARK-36792 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.2, 3.0.2, 3.2.0 Reporter: angerszhu Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36663: --- Assignee: Kousuke Saruta > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2 >Reporter: mcdull_zhang >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint") > {quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-56-28-846.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36663. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33915 [https://github.com/apache/spark/pull/33915] > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2 >Reporter: mcdull_zhang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint") > {quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-56-28-846.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36767) ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT
[ https://issues.apache.org/jira/browse/SPARK-36767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36767. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 34008 [https://github.com/apache/spark/pull/34008] > ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT > - > > Key: SPARK-36767 > URL: https://issues.apache.org/jira/browse/SPARK-36767 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36767) ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT
[ https://issues.apache.org/jira/browse/SPARK-36767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36767: --- Assignee: angerszhu > ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT > - > > Key: SPARK-36767 > URL: https://issues.apache.org/jira/browse/SPARK-36767 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36673) Incorrect Unions of struct with mismatched field name case
[ https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36673: --- Assignee: L. C. Hsieh > Incorrect Unions of struct with mismatched field name case > -- > > Key: SPARK-36673 > URL: https://issues.apache.org/jira/browse/SPARK-36673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Shardul Mahadik >Assignee: L. C. Hsieh >Priority: Major > > If a nested field has different casing on two sides of the union, the > resultant schema of the union will both fields in its schemaa > {code:java} > scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS > INNER"))) > df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner"))) > df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > scala> df1.union(df2).printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- INNER: long (nullable = false) > ||-- inner: long (nullable = false) > {code} > This seems like a bug. I would expect that Spark SQL would either just union > by index or if the user has requested {{unionByName}}, then it should matched > fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}. > However the output data only has one nested column > {code:java} > scala> df1.union(df2).show() > +---+--+ > | id|nested| > +---+--+ > | 0| {0}| > | 1| {5}| > | 0| {0}| > | 1| {5}| > +---+--+ > {code} > Trying to project fields of {{nested}} throws an error: > {code:java} > scala> df1.union(df2).select("nested.*").show() > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63) > at > org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:321) > at >
[jira] [Resolved] (SPARK-36673) Incorrect Unions of struct with mismatched field name case
[ https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36673. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 34025 [https://github.com/apache/spark/pull/34025] > Incorrect Unions of struct with mismatched field name case > -- > > Key: SPARK-36673 > URL: https://issues.apache.org/jira/browse/SPARK-36673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Shardul Mahadik >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > If a nested field has different casing on two sides of the union, the > resultant schema of the union will both fields in its schemaa > {code:java} > scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS > INNER"))) > df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner"))) > df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>] > scala> df1.union(df2).printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- INNER: long (nullable = false) > ||-- inner: long (nullable = false) > {code} > This seems like a bug. I would expect that Spark SQL would either just union > by index or if the user has requested {{unionByName}}, then it should matched > fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}. > However the output data only has one nested column > {code:java} > scala> df1.union(df2).show() > +---+--+ > | id|nested| > +---+--+ > | 0| {0}| > | 1| {5}| > | 0| {0}| > | 1| {5}| > +---+--+ > {code} > Trying to project fields of {{nested}} throws an error: > {code:java} > scala> df1.union(df2).select("nested.*").show() > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63) > at > org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at >
[jira] [Resolved] (SPARK-36718) only collapse projects if we don't duplicate expensive expressions
[ https://issues.apache.org/jira/browse/SPARK-36718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36718. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33958 [https://github.com/apache/spark/pull/33958] > only collapse projects if we don't duplicate expensive expressions > -- > > Key: SPARK-36718 > URL: https://issues.apache.org/jira/browse/SPARK-36718 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36718) only collapse projects if we don't duplicate expensive expressions
[ https://issues.apache.org/jira/browse/SPARK-36718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36718: --- Assignee: Wenchen Fan > only collapse projects if we don't duplicate expensive expressions > -- > > Key: SPARK-36718 > URL: https://issues.apache.org/jira/browse/SPARK-36718 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36764) Fix race-condition on "ensure continuous stream is being used" in KafkaContinuousTest
[ https://issues.apache.org/jira/browse/SPARK-36764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36764. - Fix Version/s: 3.2.0 Assignee: Jungtaek Lim Resolution: Fixed > Fix race-condition on "ensure continuous stream is being used" in > KafkaContinuousTest > - > > Key: SPARK-36764 > URL: https://issues.apache.org/jira/browse/SPARK-36764 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.2.0 > > > The test “ensure continuous stream is being used“ in > KafkaContinuousTestquickly checks the actual type of the execution, and stop > the query. Stopping the streaming query in continuous mode is done by > interrupting query execution thread and join indefinitely. > In parallel, started streaming query is going to generate execution plan, > including running optimizer. Some parts of SessionState can be built at that > time, as they are defined as lazy. The problem is, some of them seem to be > able to “swallow” the InterruptedException and let the thread run > continuously. > That said, the query can’t indicate whether there is a request on stopping > query, so the query won’t stop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36741) array_distinct should not return duplicated NaN
[ https://issues.apache.org/jira/browse/SPARK-36741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36741. - Fix Version/s: 3.1.3 3.2.0 3.0.4 Resolution: Fixed Issue resolved by pull request 33993 [https://github.com/apache/spark/pull/33993] > array_distinct should not return duplicated NaN > --- > > Key: SPARK-36741 > URL: https://issues.apache.org/jira/browse/SPARK-36741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.4, 3.2.0, 3.1.3 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36741) array_distinct should not return duplicated NaN
[ https://issues.apache.org/jira/browse/SPARK-36741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36741: --- Assignee: angerszhu > array_distinct should not return duplicated NaN > --- > > Key: SPARK-36741 > URL: https://issues.apache.org/jira/browse/SPARK-36741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656 ] Manu Zhang commented on SPARK-31646: [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656 ] Manu Zhang edited comment on SPARK-31646 at 9/17/21, 12:40 PM: --- [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. was (Author: mauzhang): [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36765) Spark Support for MS Sql JDBC connector with Kerberos/Keytab
[ https://issues.apache.org/jira/browse/SPARK-36765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416635#comment-17416635 ] Jakub Pawlowski commented on SPARK-36765: - As per documentation on JDBC driver, sqljdbc_auth lib should not be needed and authentication should happen using pure java libraries. This library was needed only for older versions of the driver. [https://docs.microsoft.com/en-us/sql/connect/jdbc/using-kerberos-integrated-authentication-to-connect-to-sql-server?view=sql-server-ver15] I could make it a vanilla java code work, but spark is creating programmatically the jaas configuration so maybe that's where something gets broken..? > Spark Support for MS Sql JDBC connector with Kerberos/Keytab > > > Key: SPARK-36765 > URL: https://issues.apache.org/jira/browse/SPARK-36765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Unix Redhat Environment >Reporter: Dilip Thallam Sridhar >Priority: Major > Fix For: 3.1.2 > > > Hi Team, > > We are using the Spark-3.0.2 to connect to MS SqlServer with the following > instruction > Also tried with the Spark-3.1.2 Version, > > 1) download mssql-jdbc-9.4.0.jre8.jar > 2) Generated Keytab using kinit > 3) Validate Keytab using klist > 4) Run the spark job with jdbc_library, principal and keytabs passed > .config("spark.driver.extraClassPath", spark_jar_lib) \ > .config("spark.executor.extraClassPath", spark_jar_lib) \ > 5) connection_url = > "jdbc:sqlserver://{}:{};databaseName={};integratedSecurity=true;authenticationSchema=JavaKerberos"\ > .format(jdbc_host_name, jdbc_port, jdbc_database_name) > Note: without integratedSecurity=true;authenticationSchema=JavaKerberos it > looks for the usual username/password option to connect > 6) passing the following options during spark read. > .option("principal", database_principal) \ > .option("files", database_keytab) \ > .option("keytab", database_keytab) \ > > tried with files and keytab, just files, and with all above 3 parameters > > We are unable to connect to SqlServer from Spark and getting the following > error shown below. > > A) Wanted to know if anybody was successful Spark to SqlServer? (as I see > the previous Jira has been closed) > https://issues.apache.org/jira/browse/SPARK-12312 > https://issues.apache.org/jira/browse/SPARK-31337 > > B) If yes, could you let us know if there are any additional configs needed > for Spark to connect to SqlServer please? > Appreciate if we can get inputs to resolve this error. > > > Full Stack Trace > {code} > Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is > not configured for integrated authentication. at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2329) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:1905) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:1893) > at > com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1045) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:817) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:700) > at > com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:842) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.SecureConnectionProvider.getConnection(SecureConnectionProvider.scala:44) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider.org$apache$spark$sql$execution$datasources$jdbc$connection$MSSQLConnectionProvider$$super$getConnection(MSSQLConnectionProvider.scala:69) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:69) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:67) > at
[jira] [Assigned] (SPARK-36778) Support ILIKE API on Scala(dataframe)
[ https://issues.apache.org/jira/browse/SPARK-36778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-36778: Assignee: Leona Yoda > Support ILIKE API on Scala(dataframe) > - > > Key: SPARK-36778 > URL: https://issues.apache.org/jira/browse/SPARK-36778 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Major > > Support Scala(dataframe) API on ILIKE (case sensitive LIKE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36778) Support ILIKE API on Scala(dataframe)
[ https://issues.apache.org/jira/browse/SPARK-36778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36778. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34027 [https://github.com/apache/spark/pull/34027] > Support ILIKE API on Scala(dataframe) > - > > Key: SPARK-36778 > URL: https://issues.apache.org/jira/browse/SPARK-36778 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Major > Fix For: 3.3.0 > > > Support Scala(dataframe) API on ILIKE (case sensitive LIKE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36791: Assignee: Apache Spark > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416612#comment-17416612 ] Apache Spark commented on SPARK-36791: -- User 'jiaoqingbo' has created a pull request for this issue: https://github.com/apache/spark/pull/34031 > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36791: Assignee: (was: Apache Spark) > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Target Version/s: 3.1.2, 3.2.0 (was: 3.1.2) > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Description: {code:java} NOTE: you need to replace and with actual value the JHS_POST should be JHS_HOST {code} was: {code:java} NOTE: you need to replace and with actual value the JHS_POST should be JHS_HOST {code} > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Fix Version/s: 3.2.0 > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Description: {code:java} NOTE: you need to replace and with actual value the JHS_POST should be JHS_HOST {code} was: {code:java} // code placeholder NOTE: you need to replace and with actual value the JHS_POST should be JHS_HOST {code} > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > Attachments: error_message.png > > > {code:java} > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Description: {code:java} // code placeholder NOTE: you need to replace and with actual value the JHS_POST should be JHS_HOST {code} > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > Attachments: error_message.png > > > {code:java} > // code placeholder > NOTE: you need to replace and with actual value > the JHS_POST should be JHS_HOST > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Attachment: error_message.png > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > Attachments: error_message.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Attachment: 微信截图_20210917181324.png > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
[ https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-36791: Attachment: (was: 微信截图_20210917181324.png) > this is a spelling mistakes in running-on-yarn.md file where JHS_POST should > be JHS_HOST > - > > Key: SPARK-36791 > URL: https://issues.apache.org/jira/browse/SPARK-36791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.2.0 >Reporter: qingbo jiao >Priority: Minor > Fix For: 3.1.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
qingbo jiao created SPARK-36791: --- Summary: this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST Key: SPARK-36791 URL: https://issues.apache.org/jira/browse/SPARK-36791 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.1.2, 3.2.0 Reporter: qingbo jiao Fix For: 3.1.2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tongwei updated SPARK-36727: Priority: Major (was: Minor) > Support sql overwrite a path that is also being read from when > partitionOverwriteMode is dynamic > > > Key: SPARK-36727 > URL: https://issues.apache.org/jira/browse/SPARK-36727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tongwei >Priority: Major > > {code:java} > // non-partitioned table overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; > INSERT OVERWRITE TABLE tbl SELECT 0,1; > INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; > // partitioned table static overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 > INT); > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE > p1=2021; > {code} > When we run the above query, an error will be throwed "Cannot overwrite a > path that is also being read from" > We need to support this operation when the > spark.sql.sources.partitionOverwriteMode is dynamic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36789) use the correct constant type as the null value holder in array functions
[ https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36789. - Fix Version/s: 3.0.4 3.1.3 3.2.0 Resolution: Fixed > use the correct constant type as the null value holder in array functions > - > > Key: SPARK-36789 > URL: https://issues.apache.org/jira/browse/SPARK-36789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0, 3.1.3, 3.0.4 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36765) Spark Support for MS Sql JDBC connector with Kerberos/Keytab
[ https://issues.apache.org/jira/browse/SPARK-36765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416519#comment-17416519 ] Gabor Somogyi commented on SPARK-36765: --- It was long time ago when I've done that and AFAIR it took me almost a month to make it work so definitely a horror task! My knowledge is cloudy because it was not yesterday but I remember something like this: The exception generally indicates that the driver can not find the appropriate sqljdbc_auth lib in the JVM library path. To correct the problem, one can use use the java -D option to specify the "java.library.path" system property value. Worth to mention full path must be set as path, otherwise it was not working. All in all I've faced at least 5-6 different issues which were extremely hard to address. Hope others need less time to solve the issues. > Spark Support for MS Sql JDBC connector with Kerberos/Keytab > > > Key: SPARK-36765 > URL: https://issues.apache.org/jira/browse/SPARK-36765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Unix Redhat Environment >Reporter: Dilip Thallam Sridhar >Priority: Major > Fix For: 3.1.2 > > > Hi Team, > > We are using the Spark-3.0.2 to connect to MS SqlServer with the following > instruction > Also tried with the Spark-3.1.2 Version, > > 1) download mssql-jdbc-9.4.0.jre8.jar > 2) Generated Keytab using kinit > 3) Validate Keytab using klist > 4) Run the spark job with jdbc_library, principal and keytabs passed > .config("spark.driver.extraClassPath", spark_jar_lib) \ > .config("spark.executor.extraClassPath", spark_jar_lib) \ > 5) connection_url = > "jdbc:sqlserver://{}:{};databaseName={};integratedSecurity=true;authenticationSchema=JavaKerberos"\ > .format(jdbc_host_name, jdbc_port, jdbc_database_name) > Note: without integratedSecurity=true;authenticationSchema=JavaKerberos it > looks for the usual username/password option to connect > 6) passing the following options during spark read. > .option("principal", database_principal) \ > .option("files", database_keytab) \ > .option("keytab", database_keytab) \ > > tried with files and keytab, just files, and with all above 3 parameters > > We are unable to connect to SqlServer from Spark and getting the following > error shown below. > > A) Wanted to know if anybody was successful Spark to SqlServer? (as I see > the previous Jira has been closed) > https://issues.apache.org/jira/browse/SPARK-12312 > https://issues.apache.org/jira/browse/SPARK-31337 > > B) If yes, could you let us know if there are any additional configs needed > for Spark to connect to SqlServer please? > Appreciate if we can get inputs to resolve this error. > > > Full Stack Trace > {code} > Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is > not configured for integrated authentication. at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2329) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:1905) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:1893) > at > com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1045) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:817) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:700) > at > com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:842) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.SecureConnectionProvider.getConnection(SecureConnectionProvider.scala:44) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider.org$apache$spark$sql$execution$datasources$jdbc$connection$MSSQLConnectionProvider$$super$getConnection(MSSQLConnectionProvider.scala:69) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:69) > at >
[jira] [Commented] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin
[ https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416490#comment-17416490 ] Apache Spark commented on SPARK-36790: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34030 > Update user-facing catalog to adapt CatalogPlugin > - > > Key: SPARK-36790 > URL: https://issues.apache.org/jira/browse/SPARK-36790 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Minor > Fix For: 3.3.0 > > > At now the SparkSession.catalog always retuan a CatalogImpl with a > SessionCatalog that is SparkSession.sessionState.catalog > {code:java} > @transient lazy val catalog: Catalog = new CatalogImpl(self) > {code} > {code:java} > private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog > {code} > So we can do the action is just based the SessionCatalog, we could not do > action based user-defined CatalogPlugin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin
[ https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36790: Assignee: Apache Spark > Update user-facing catalog to adapt CatalogPlugin > - > > Key: SPARK-36790 > URL: https://issues.apache.org/jira/browse/SPARK-36790 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: Apache Spark >Priority: Minor > Fix For: 3.3.0 > > > At now the SparkSession.catalog always retuan a CatalogImpl with a > SessionCatalog that is SparkSession.sessionState.catalog > {code:java} > @transient lazy val catalog: Catalog = new CatalogImpl(self) > {code} > {code:java} > private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog > {code} > So we can do the action is just based the SessionCatalog, we could not do > action based user-defined CatalogPlugin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin
[ https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36790: Assignee: (was: Apache Spark) > Update user-facing catalog to adapt CatalogPlugin > - > > Key: SPARK-36790 > URL: https://issues.apache.org/jira/browse/SPARK-36790 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Minor > Fix For: 3.3.0 > > > At now the SparkSession.catalog always retuan a CatalogImpl with a > SessionCatalog that is SparkSession.sessionState.catalog > {code:java} > @transient lazy val catalog: Catalog = new CatalogImpl(self) > {code} > {code:java} > private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog > {code} > So we can do the action is just based the SessionCatalog, we could not do > action based user-defined CatalogPlugin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
[ https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32709. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33432 [https://github.com/apache/spark/pull/33432] > Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2) > -- > > Key: SPARK-32709 > URL: https://issues.apache.org/jira/browse/SPARK-32709 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > Attachments: 91275701_stage6_metrics.png > > > Hive ORC/Parquet write code path is same as data source v1 code path > (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet > bucketed table with hivehash. The change is to custom `bucketIdExpression` to > use hivehash when the table is Hive bucketed table, and the Hive version is > 1.x.y or 2.x.y. > > This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and > 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
[ https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32709: --- Assignee: Cheng Su > Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2) > -- > > Key: SPARK-32709 > URL: https://issues.apache.org/jira/browse/SPARK-32709 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Attachments: 91275701_stage6_metrics.png > > > Hive ORC/Parquet write code path is same as data source v1 code path > (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet > bucketed table with hivehash. The change is to custom `bucketIdExpression` to > use hivehash when the table is Hive bucketed table, and the Hive version is > 1.x.y or 2.x.y. > > This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and > 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin
[ https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PengLei updated SPARK-36790: Description: At now the SparkSession.catalog always retuan a CatalogImpl with a SessionCatalog that is SparkSession.sessionState.catalog {code:java} @transient lazy val catalog: Catalog = new CatalogImpl(self) {code} {code:java} private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog {code} So we can do the action is just based the SessionCatalog, we could not do action based user-defined CatalogPlugin. > Update user-facing catalog to adapt CatalogPlugin > - > > Key: SPARK-36790 > URL: https://issues.apache.org/jira/browse/SPARK-36790 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Minor > Fix For: 3.3.0 > > > At now the SparkSession.catalog always retuan a CatalogImpl with a > SessionCatalog that is SparkSession.sessionState.catalog > {code:java} > @transient lazy val catalog: Catalog = new CatalogImpl(self) > {code} > {code:java} > private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog > {code} > So we can do the action is just based the SessionCatalog, we could not do > action based user-defined CatalogPlugin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin
PengLei created SPARK-36790: --- Summary: Update user-facing catalog to adapt CatalogPlugin Key: SPARK-36790 URL: https://issues.apache.org/jira/browse/SPARK-36790 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.3.0 Reporter: PengLei Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36789) use the correct constant type as the null value holder in array functions
[ https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416476#comment-17416476 ] Apache Spark commented on SPARK-36789: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/34029 > use the correct constant type as the null value holder in array functions > - > > Key: SPARK-36789 > URL: https://issues.apache.org/jira/browse/SPARK-36789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36789) use the correct constant type as the null value holder in array functions
[ https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36789: Assignee: Wenchen Fan (was: Apache Spark) > use the correct constant type as the null value holder in array functions > - > > Key: SPARK-36789 > URL: https://issues.apache.org/jira/browse/SPARK-36789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36789) use the correct constant type as the null value holder in array functions
[ https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36789: Assignee: Apache Spark (was: Wenchen Fan) > use the correct constant type as the null value holder in array functions > - > > Key: SPARK-36789 > URL: https://issues.apache.org/jira/browse/SPARK-36789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36789) use the correct constant type as the null value holder in array functions
Wenchen Fan created SPARK-36789: --- Summary: use the correct constant type as the null value holder in array functions Key: SPARK-36789 URL: https://issues.apache.org/jira/browse/SPARK-36789 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org