[jira] [Updated] (SPARK-32623) Set the downloaded artifact location to explicitly glob test reports
[ https://issues.apache.org/jira/browse/SPARK-32623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32623: - Attachment: Screen Shot 2020-08-14 at 3.15.47 PM.png > Set the downloaded artifact location to explicitly glob test reports > > > Key: SPARK-32623 > URL: https://issues.apache.org/jira/browse/SPARK-32623 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2020-08-14 at 3.15.47 PM.png > > > After SPARK-32357, now GitHub Actions reports the test results. However, > seems some tests are not reported correctly. I attached an image. > Looks when you run tests via using tags to exclude/include tests, it still > generates empty JUnit test reports. So, when tests are split by using the > tags, it seems possible to overwrite the previous proper report. > We can avoid this problem by explicitly setting the downloaded artifact > directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32623) Set the downloaded artifact location to explicitly glob test reports
[ https://issues.apache.org/jira/browse/SPARK-32623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32623: - Summary: Set the downloaded artifact location to explicitly glob test reports (was: Set the downloaded artifact location to explicitly glob test reportss) > Set the downloaded artifact location to explicitly glob test reports > > > Key: SPARK-32623 > URL: https://issues.apache.org/jira/browse/SPARK-32623 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > After SPARK-32357, now GitHub Actions reports the test results. However, > seems some tests are not reported correctly. I attached an image. > Looks when you run tests via using tags to exclude/include tests, it still > generates empty JUnit test reports. So, when tests are split by using the > tags, it seems possible to overwrite the previous proper report. > We can avoid this problem by explicitly setting the downloaded artifact > directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32623) Set the downloaded artifact location to explicitly glob test reportss
Hyukjin Kwon created SPARK-32623: Summary: Set the downloaded artifact location to explicitly glob test reportss Key: SPARK-32623 URL: https://issues.apache.org/jira/browse/SPARK-32623 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 3.1.0 Reporter: Hyukjin Kwon After SPARK-32357, now GitHub Actions reports the test results. However, seems some tests are not reported correctly. I attached an image. Looks when you run tests via using tags to exclude/include tests, it still generates empty JUnit test reports. So, when tests are split by using the tags, it seems possible to overwrite the previous proper report. We can avoid this problem by explicitly setting the downloaded artifact directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32606) Remove the fork of action-download-artifact in test_report.yml
[ https://issues.apache.org/jira/browse/SPARK-32606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178162#comment-17178162 ] Hyukjin Kwon commented on SPARK-32606: -- Here I raised the PR https://github.com/dawidd6/action-download-artifact/pull/24 > Remove the fork of action-download-artifact in test_report.yml > -- > > Key: SPARK-32606 > URL: https://issues.apache.org/jira/browse/SPARK-32606 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed > in order to add the support to download all artifacts. It should be > contributed back to the original > plugin and avoid using the fork. > Alternatively, we can use the official actions/download-artifact once they > support to download artifacts > between different workloads, see also > https://github.com/actions/download-artifact/issues/3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32605) Remove the fork of action-surefire-report in test_report.yml
[ https://issues.apache.org/jira/browse/SPARK-32605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178160#comment-17178160 ] Hyukjin Kwon commented on SPARK-32605: -- Here I raised the PR https://github.com/ScaCap/action-surefire-report/pull/14 > Remove the fork of action-surefire-report in test_report.yml > > > Key: SPARK-32605 > URL: https://issues.apache.org/jira/browse/SPARK-32605 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > It was forked to have a custom fix > https://github.com/HyukjinKwon/action-surefire-report/commit/c96094cc35061fcf154a7cb46807f2f3e2339476 > in order to add the support of custom target commit SHA. It should be > contributed back to the original plugin and avoid using the fork. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178157#comment-17178157 ] Apache Spark commented on SPARK-32622: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29427 > Add case-sensitivity test for ORC predicate pushdown > - > > Key: SPARK-32622 > URL: https://issues.apache.org/jira/browse/SPARK-32622 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We should add case-sensitivity test for ORC predicate pushdown to increase > test coverage for ORC predicate pushdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32622: Assignee: L. C. Hsieh (was: Apache Spark) > Add case-sensitivity test for ORC predicate pushdown > - > > Key: SPARK-32622 > URL: https://issues.apache.org/jira/browse/SPARK-32622 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We should add case-sensitivity test for ORC predicate pushdown to increase > test coverage for ORC predicate pushdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32622: Assignee: Apache Spark (was: L. C. Hsieh) > Add case-sensitivity test for ORC predicate pushdown > - > > Key: SPARK-32622 > URL: https://issues.apache.org/jira/browse/SPARK-32622 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > We should add case-sensitivity test for ORC predicate pushdown to increase > test coverage for ORC predicate pushdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown
L. C. Hsieh created SPARK-32622: --- Summary: Add case-sensitivity test for ORC predicate pushdown Key: SPARK-32622 URL: https://issues.apache.org/jira/browse/SPARK-32622 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh We should add case-sensitivity test for ORC predicate pushdown to increase test coverage for ORC predicate pushdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang updated SPARK-32542: -- Description: Split an expand into several small Expand, which contains the Specified number of projections. For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, b, c, d with cube. It can expand 2^4 times of original data size. Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve performance in multidimensional analysis when data is huge. was: Split an expand into several small Expand, which contains the Specified number of projections. For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, b, c, d with cube. It can expand 2^4 times of original data size. Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve performance in multidimensional analysis when data is huge. runBenchmark("cube multianalysis agg") { val N = 20 << 20 val benchmark = new Benchmark("cube multianalysis agg", N, output = output) def f(): Unit = { val df = spark.range(N).cache() df.selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as string) as k2", "cast(id & 1023 as int) as k3", "cast(id & 1023 as double) as k4", "cast(id & 1023 as float) as k5") .cube("k1", "k2", "k3", "k4", "k5") .sum() .noop() df.unpersist() } benchmark.addCase("grouping = F") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "false") { f() } } benchmark.addCase("grouping = T projectionSize= 16") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "true", SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "16") { f() } } benchmark.addCase("grouping = T projectionSize= 8") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "true", SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "8") { f() } } benchmark.run() } Running benchmark: cube multianalysis agg : cube 5 fields k1, k2, k3, k4, k5 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 8 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative grouping = F 54329 54931 852 0.42590.6 1.0X grouping = T projectionSize= 16 44584 44781 278 0.52125.9 1.2X grouping = T projectionSize= 842764 43272 718 0.52039.1 1.3X Running benchmark: cube multianalysis agg : cube 6 fields k1, k2, k3, k4, k5, k6 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative grouping = F 141607 143424 2569 0.16752.4 1.0X grouping = T projectionSize= 32 109465 109603 196 0.25219.7 1.3X grouping = T projectionSize= 16 99752 100411 933 0.24756.5 1.4X Running benchmark: cube multianalysis agg : cube 7 fields k1, k2, k3, k4, k5, k6, k7 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 64 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg:
[jira] [Updated] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang updated SPARK-32542: -- Summary: Add an optimizer rule to split an Expand into multiple Expands for aggregates (was: add a batch for optimizing logicalPlan) > Add an optimizer rule to split an Expand into multiple Expands for aggregates > - > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: karl wang >Priority: Major > Fix For: 3.0.0 > > > Split an expand into several small Expand, which contains the Specified > number of projections. > For instance, like this sql.select a, b, c, d, count(1) from table1 group by > a, b, c, d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang updated SPARK-32542: -- Description: Split an expand into several small Expand, which contains the Specified number of projections. For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, b, c, d with cube. It can expand 2^4 times of original data size. Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve performance in multidimensional analysis when data is huge. runBenchmark("cube multianalysis agg") { val N = 20 << 20 val benchmark = new Benchmark("cube multianalysis agg", N, output = output) def f(): Unit = { val df = spark.range(N).cache() df.selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as string) as k2", "cast(id & 1023 as int) as k3", "cast(id & 1023 as double) as k4", "cast(id & 1023 as float) as k5") .cube("k1", "k2", "k3", "k4", "k5") .sum() .noop() df.unpersist() } benchmark.addCase("grouping = F") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "false") { f() } } benchmark.addCase("grouping = T projectionSize= 16") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "true", SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "16") { f() } } benchmark.addCase("grouping = T projectionSize= 8") { _ => withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.GROUPING_WITH_UNION.key -> "true", SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "8") { f() } } benchmark.run() } Running benchmark: cube multianalysis agg : cube 5 fields k1, k2, k3, k4, k5 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 8 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative grouping = F 54329 54931 852 0.42590.6 1.0X grouping = T projectionSize= 16 44584 44781 278 0.52125.9 1.2X grouping = T projectionSize= 842764 43272 718 0.52039.1 1.3X Running benchmark: cube multianalysis agg : cube 6 fields k1, k2, k3, k4, k5, k6 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative grouping = F 141607 143424 2569 0.16752.4 1.0X grouping = T projectionSize= 32 109465 109603 196 0.25219.7 1.3X grouping = T projectionSize= 16 99752 100411 933 0.24756.5 1.4X Running benchmark: cube multianalysis agg : cube 7 fields k1, k2, k3, k4, k5, k6, k7 Running case: GROUPING_WITH_UNION off Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 64 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32 Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15 Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz cube multianalysis agg: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative grouping = F 516941 519658 NaN 0.0 24649.7 1.0X grouping = T projectionSize= 64 267170 267547 533 0.1 12739.6 1.9X grouping = T project
[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang updated SPARK-32542: -- Priority: Major (was: Minor) > add a batch for optimizing logicalPlan > -- > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: karl wang >Priority: Major > Fix For: 3.0.0 > > > Split an expand into several smallExpand,which contains the Specified number > of projections. > For instance,like this sql.select a,b,c,d,count(1) from table1 group by > a,b,c,d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178148#comment-17178148 ] Rohit Mishra commented on SPARK-32613: -- [~dagrawal3409], Please avoid populating Fix version/s as they are reserved for committers. > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra updated SPARK-32613: - Fix Version/s: (was: 3.1.0) > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32619) converting dataframe to dataset for the json schema
[ https://issues.apache.org/jira/browse/SPARK-32619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178147#comment-17178147 ] Rohit Mishra commented on SPARK-32619: -- [~Manjay7869], Can you please update- Environment detail & reproducible steps. Please rewrite the description by introducing code blocks wherever you have tried to explain any code snippet. If you are not sure please try to utilize Stack overflow and User mail list to get an answer and probably a solution. Please read the guideline for future reference- [https://spark.apache.org/community.html] > converting dataframe to dataset for the json schema > --- > > Key: SPARK-32619 > URL: https://issues.apache.org/jira/browse/SPARK-32619 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Manjay Kumar >Priority: Minor > > have a schema > > { > Details :[{ > phone : "98977999" > contacts: [{ > name:"manjay" > -- has missing street block in json > ]} > ]} > > } > > Case class , based on schema > case class Details ( > phone : String, > contacts : Array[Adress] > ) > > case class Adress( > name : String > street : String > > ) > > > this throws : No such struct field street - Analysis exception. > > dataframe.as[Details] > > Is this a bug ?? or there is a resolution for this. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32621) "path" option is added again to input paths during infer()
[ https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32621: Assignee: Apache Spark > "path" option is added again to input paths during infer() > -- > > Key: SPARK-32621 > URL: https://issues.apache.org/jira/browse/SPARK-32621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > When "path" option is used when creating a DataFrame, it can cause issues > during infer. > {code:java} > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > val path = "/tmp" > val df = spark.range(2) > df.write.json(path + "/p=1") > df.write.json(path + "/p=2") > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > // This works fine. > assert(spark.read.options(extraOptions).json(path).count == 2) > // The following with "path" option fails with the following: > // assertion failed: Conflicting directory structures detected. Suspicious > paths > //file:/tmp > //file:/tmp/p=1 > assert(spark.read.options(extraOptions).format("json").option("path", > path).load.count() === 2) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32621) "path" option is added again to input paths during infer()
[ https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178125#comment-17178125 ] Apache Spark commented on SPARK-32621: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/29437 > "path" option is added again to input paths during infer() > -- > > Key: SPARK-32621 > URL: https://issues.apache.org/jira/browse/SPARK-32621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Terry Kim >Priority: Minor > > When "path" option is used when creating a DataFrame, it can cause issues > during infer. > {code:java} > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > val path = "/tmp" > val df = spark.range(2) > df.write.json(path + "/p=1") > df.write.json(path + "/p=2") > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > // This works fine. > assert(spark.read.options(extraOptions).json(path).count == 2) > // The following with "path" option fails with the following: > // assertion failed: Conflicting directory structures detected. Suspicious > paths > //file:/tmp > //file:/tmp/p=1 > assert(spark.read.options(extraOptions).format("json").option("path", > path).load.count() === 2) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32621) "path" option is added again to input paths during infer()
[ https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32621: Assignee: (was: Apache Spark) > "path" option is added again to input paths during infer() > -- > > Key: SPARK-32621 > URL: https://issues.apache.org/jira/browse/SPARK-32621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Terry Kim >Priority: Minor > > When "path" option is used when creating a DataFrame, it can cause issues > during infer. > {code:java} > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > val path = "/tmp" > val df = spark.range(2) > df.write.json(path + "/p=1") > df.write.json(path + "/p=2") > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > // This works fine. > assert(spark.read.options(extraOptions).json(path).count == 2) > // The following with "path" option fails with the following: > // assertion failed: Conflicting directory structures detected. Suspicious > paths > //file:/tmp > //file:/tmp/p=1 > assert(spark.read.options(extraOptions).format("json").option("path", > path).load.count() === 2) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32621) "path" option is added again to input paths during infer()
[ https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim updated SPARK-32621: -- Priority: Minor (was: Major) > "path" option is added again to input paths during infer() > -- > > Key: SPARK-32621 > URL: https://issues.apache.org/jira/browse/SPARK-32621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Terry Kim >Priority: Minor > > When "path" option is used when creating a DataFrame, it can cause issues > during infer. > {code:java} > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > val path = "/tmp" > val df = spark.range(2) > df.write.json(path + "/p=1") > df.write.json(path + "/p=2") > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > // This works fine. > assert(spark.read.options(extraOptions).json(path).count == 2) > // The following with "path" option fails with the following: > // assertion failed: Conflicting directory structures detected. Suspicious > paths > //file:/tmp > //file:/tmp/p=1 > assert(spark.read.options(extraOptions).format("json").option("path", > path).load.count() === 2) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32620) Reset the numPartitions metric when DPP is enabled
[ https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32620: Summary: Reset the numPartitions metric when DPP is enabled (was: Fix number of partitions read metric when DPP enabled) > Reset the numPartitions metric when DPP is enabled > -- > > Key: SPARK-32620 > URL: https://issues.apache.org/jira/browse/SPARK-32620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32621) "path" option is added again to input paths during infer()
Terry Kim created SPARK-32621: - Summary: "path" option is added again to input paths during infer() Key: SPARK-32621 URL: https://issues.apache.org/jira/browse/SPARK-32621 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.6, 3.0.1, 3.1.0 Reporter: Terry Kim When "path" option is used when creating a DataFrame, it can cause issues during infer. {code:java} class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } val path = "/tmp" val df = spark.range(2) df.write.json(path + "/p=1") df.write.json(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) // This works fine. assert(spark.read.options(extraOptions).json(path).count == 2) // The following with "path" option fails with the following: // assertion failed: Conflicting directory structures detected. Suspicious paths // file:/tmp // file:/tmp/p=1 assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32620) Fix number of partitions read metric when DPP enabled
[ https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32620: Assignee: (was: Apache Spark) > Fix number of partitions read metric when DPP enabled > - > > Key: SPARK-32620 > URL: https://issues.apache.org/jira/browse/SPARK-32620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32620) Fix number of partitions read metric when DPP enabled
[ https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32620: Assignee: Apache Spark > Fix number of partitions read metric when DPP enabled > - > > Key: SPARK-32620 > URL: https://issues.apache.org/jira/browse/SPARK-32620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32620) Fix number of partitions read metric when DPP enabled
[ https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178118#comment-17178118 ] Apache Spark commented on SPARK-32620: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29436 > Fix number of partitions read metric when DPP enabled > - > > Key: SPARK-32620 > URL: https://issues.apache.org/jira/browse/SPARK-32620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32620) Fix number of partitions read metric when DPP enabled
[ https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32620: Summary: Fix number of partitions read metric when DPP enabled (was: Fix number of partitions read when DPP enabled) > Fix number of partitions read metric when DPP enabled > - > > Key: SPARK-32620 > URL: https://issues.apache.org/jira/browse/SPARK-32620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32620) Fix number of partitions read when DPP enabled
Yuming Wang created SPARK-32620: --- Summary: Fix number of partitions read when DPP enabled Key: SPARK-32620 URL: https://issues.apache.org/jira/browse/SPARK-32620 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)
[ https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178101#comment-17178101 ] Zirui Xu commented on SPARK-32092: -- I think CrossValidatorModel.copy() is affected by a similar issue of losing numFolds too. I will attempt a fix. > CrossvalidatorModel does not save all submodels (it saves only 3) > - > > Key: SPARK-32092 > URL: https://issues.apache.org/jira/browse/SPARK-32092 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0, 2.4.5 > Environment: Ran on two systems: > * Local pyspark installation (Windows): spark 2.4.5 > * Spark 2.4.0 on a cluster >Reporter: An De Rijdt >Priority: Major > > When saving a CrossValidatorModel with more than 3 subModels and loading > again, a different amount of subModels is returned. It seems every time 3 > subModels are returned. > With less than two submodels (so 2 folds) writing plainly fails. > Issue seems to be (but I am not so familiar with the scala/java side) > * python object is converted to scala/java > * in scala we save subModels until numFolds: > > {code:java} > val subModelsPath = new Path(path, "subModels") >for (splitIndex <- 0 until instance.getNumFolds) { > val splitPath = new Path(subModelsPath, > s"fold${splitIndex.toString}") > for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { > val modelPath = new Path(splitPath, paramIndex.toString).toString > > instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath) > } > {code} > * numFolds is not available on the CrossValidatorModel in pyspark > * default numFolds is 3 so somehow it tries to save 3 subModels. > The first issue can be reproduced by following failing tests, where spark is > a SparkSession and tmp_path is a (temporary) directory. > {code:java} > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, > CrossValidatorModel > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.linalg import Vectors > def test_save_load_cross_validator(spark, tmp_path): > temp_path = str(tmp_path) > dataset = spark.createDataFrame( > [ > (Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0), > ] > * 10, > ["features", "label"], > ) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator( > estimator=lr, > estimatorParamMaps=grid, > evaluator=evaluator, > collectSubModels=True, > numFolds=4, > ) > cvModel = cv.fit(dataset) > # test save/load of CrossValidatorModel > cvModelPath = temp_path + "/cvModel" > cvModel.write().overwrite().save(cvModelPath) > loadedModel = CrossValidatorModel.load(cvModelPath) > assert len(loadedModel.subModels) == len(cvModel.subModels) > {code} > > The second as follows (will fail writing): > {code:java} > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, > CrossValidatorModel > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.linalg import Vectors > def test_save_load_cross_validator(spark, tmp_path): > temp_path = str(tmp_path) > dataset = spark.createDataFrame( > [ > (Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0), > ] > * 10, > ["features", "label"], > ) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator( > estimator=lr, > estimatorParamMaps=grid, > evaluator=evaluator, > collectSubModels=True, > numFolds=2, > ) > cvModel = cv.fit(dataset) > # test save/load of CrossValidatorModel > cvModelPath = temp_path + "/cvModel" > cvModel.write().overwrite().save(cvModelPath) > loadedModel = CrossValidatorModel.load(cvModelPath) > assert len(loadedModel.subModels) == len(cvModel.subModels) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...
[jira] [Resolved] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-32119. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28939 [https://github.com/apache/spark/pull/28939] > ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars > -- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > ExecutorPlugin can't work with Standalone Cluster and Kubernetes > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:45 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type:
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:44 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael commented on SPARK-25390: Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:42 PM: -- Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178042#comment-17178042 ] Sean R. Owen commented on SPARK-26132: -- Yes exactly. Unfortunately you have to do that. What that bought us was better backwards-compatibility in Spark 3. > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178035#comment-17178035 ] Rafael commented on SPARK-26132: Thank you [~srowen] yes it works. {code:java} val f: Iterator[Row] => Unit = (iterator: Iterator[Row]) => {} dataFrame.foreachPartition(f){code} > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178032#comment-17178032 ] Apache Spark commented on SPARK-32609: -- User 'mingjialiu' has created a pull request for this issue: https://github.com/apache/spark/pull/29435 > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178031#comment-17178031 ] Apache Spark commented on SPARK-32609: -- User 'mingjialiu' has created a pull request for this issue: https://github.com/apache/spark/pull/29435 > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178024#comment-17178024 ] Sean R. Owen commented on SPARK-26132: -- I think you need to cast your whole lambda function expression as {{(Iterator[T] => Unit)}} or similar. > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178019#comment-17178019 ] Rafael commented on SPARK-26132: [~srowen] In release notes for Spark 3.0.0 they mentioned your ticket {quote}Due to the upgrade of Scala 2.12, {{DataStreamWriter.foreachBatch}} is not source compatible for Scala program. You need to update your Scala source code to disambiguate between Scala function and Java lambda. (SPARK-26132) {quote} so maybe you know how we should use *foreachPartition* now in Scala code {code:java} dataFrame.foreachPartition(partition => { partition .grouped(Config.BATCH_SIZE) .foreach(batch => { } } {code} Right now it call on any method like grouped, foreach cause the exception *value grouped is not a member of Object* > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names
[ https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Gramme updated SPARK-32618: -- Description: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. Seems related with SPARK-21791(which was solved in 2.3.0). In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct>' {code} was: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. Seems related with [#SPARK-21791] (which was solved in 2.3.0). In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct>' {code} > ORC writer doesn't support colon in column names > > > Key: SPARK-32618 > URL: https://issues.apache.org/jira/browse/SPARK-32618 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Pierre Gramme >Priority: Major > > Hi, > I'm getting an {{IllegalArgumentException: Can't parse category at > 'struct'}} when exporting to ORC a dataframe whose column names > contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if > the name with colon appears nested as member of a struct. > Seems related with SPARK-21791(which was solved in 2.3.0). > In my real-life case, the column was actually {{xsi:type}}, coming from some > parsed xml. Thus other users may be affected too. > Has it been fixed after Spark 2.3.0? (sorry, can't test easily) > Any workaround? Would be acceptable for me to find and replace all colons > with underscore in column names, but not easy to do in a big set of nested > struct columns... > Thanks > > > {code:java} > spark.conf.set("spark.sql.orc.impl", "native") > val dfColon = Seq(1).toDF("a:b") > dfColon.printSchema() > dfColon.show() > dfColon.write.orc("test_colon") > // Fails with IllegalArgumentException: Can't parse category at > 'struct' > > import org.apache.spark.sql.functions.struct > val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") > dfColonStruct.printSchema() > dfColonStruct.show() > dfColon.write.orc("test_colon_struct") > // Fails with IllegalArgumentException: Can't parse category at > 'struct>' > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names
[ https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Gramme updated SPARK-32618: -- Description: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. Seems related with [#SPARK-21791] (which was solved in 2.3.0). In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct>' {code} was: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct>' {code} > ORC writer doesn't support colon in column names > > > Key: SPARK-32618 > URL: https://issues.apache.org/jira/browse/SPARK-32618 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Pierre Gramme >Priority: Major > > Hi, > I'm getting an {{IllegalArgumentException: Can't parse category at > 'struct'}} when exporting to ORC a dataframe whose column names > contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if > the name with colon appears nested as member of a struct. > Seems related with [#SPARK-21791] (which was solved in 2.3.0). > In my real-life case, the column was actually {{xsi:type}}, coming from some > parsed xml. Thus other users may be affected too. > Has it been fixed after Spark 2.3.0? (sorry, can't test easily) > Any workaround? Would be acceptable for me to find and replace all colons > with underscore in column names, but not easy to do in a big set of nested > struct columns... > Thanks > > > {code:java} > spark.conf.set("spark.sql.orc.impl", "native") > val dfColon = Seq(1).toDF("a:b") > dfColon.printSchema() > dfColon.show() > dfColon.write.orc("test_colon") > // Fails with IllegalArgumentException: Can't parse category at > 'struct' > > import org.apache.spark.sql.functions.struct > val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") > dfColonStruct.printSchema() > dfColonStruct.show() > dfColon.write.orc("test_colon_struct") > // Fails with IllegalArgumentException: Can't parse category at > 'struct>' > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names
[ https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Gramme updated SPARK-32618: -- Description: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct>' {code} was: Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct' {code} > ORC writer doesn't support colon in column names > > > Key: SPARK-32618 > URL: https://issues.apache.org/jira/browse/SPARK-32618 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Pierre Gramme >Priority: Major > > Hi, > I'm getting an {{IllegalArgumentException: Can't parse category at > 'struct'}} when exporting to ORC a dataframe whose column names > contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if > the name with colon appears nested as member of a struct. > In my real-life case, the column was actually {{xsi:type}}, coming from some > parsed xml. Thus other users may be affected too. > Has it been fixed after Spark 2.3.0? (sorry, can't test easily) > Any workaround? Would be acceptable for me to find and replace all colons > with underscore in column names, but not easy to do in a big set of nested > struct columns... > Thanks > > > {code:java} > spark.conf.set("spark.sql.orc.impl", "native") > val dfColon = Seq(1).toDF("a:b") > dfColon.printSchema() > dfColon.show() > dfColon.write.orc("test_colon") > // Fails with IllegalArgumentException: Can't parse category at > 'struct' > > import org.apache.spark.sql.functions.struct > val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") > dfColonStruct.printSchema() > dfColonStruct.show() > dfColon.write.orc("test_colon_struct") > // Fails with IllegalArgumentException: Can't parse category at > 'struct>' > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32619) converting dataframe to dataset for the json schema
Manjay Kumar created SPARK-32619: Summary: converting dataframe to dataset for the json schema Key: SPARK-32619 URL: https://issues.apache.org/jira/browse/SPARK-32619 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Reporter: Manjay Kumar have a schema { Details :[{ phone : "98977999" contacts: [{ name:"manjay" -- has missing street block in json ]} ]} } Case class , based on schema case class Details ( phone : String, contacts : Array[Adress] ) case class Adress( name : String street : String ) this throws : No such struct field street - Analysis exception. dataframe.as[Details] Is this a bug ?? or there is a resolution for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32618) ORC writer doesn't support colon in column names
Pierre Gramme created SPARK-32618: - Summary: ORC writer doesn't support colon in column names Key: SPARK-32618 URL: https://issues.apache.org/jira/browse/SPARK-32618 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.0 Reporter: Pierre Gramme Hi, I'm getting an {{IllegalArgumentException: Can't parse category at 'struct'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct. In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too. Has it been fixed after Spark 2.3.0? (sorry, can't test easily) Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns... Thanks {code:java} spark.conf.set("spark.sql.orc.impl", "native") val dfColon = Seq(1).toDF("a:b") dfColon.printSchema() dfColon.show() dfColon.write.orc("test_colon") // Fails with IllegalArgumentException: Can't parse category at 'struct' import org.apache.spark.sql.functions.struct val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b") dfColonStruct.printSchema() dfColonStruct.show() dfColon.write.orc("test_colon_struct") // Fails with IllegalArgumentException: Can't parse category at 'struct' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177820#comment-17177820 ] Apache Spark commented on SPARK-32526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29434 > Let sql/catalyst module tests pass for Scala 2.13 > - > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-and-aborted-20200806 > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics including old ones during the execution, which will cause NoSuchElementException, since the metricsType is already updated with plan rewritten. So we need to filter out those outdated metrics. was: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:62
[jira] [Updated] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32616: - Issue Type: Improvement (was: Bug) > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32616: --- Assignee: wuyi > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32616. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29432 [https://github.com/apache/spark/pull/29432] > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177765#comment-17177765 ] Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:24 PM: - About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those settings: {code:java} spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py {code} I don't know they still work but personally I would close the ticket and not put this in the doc. I think it is not the right way to to it as it doesn't scale to 100s executors and can produce race conditions for the tasks running on the same executor (multiple pip installs at the same time on the same node) was (Author: fhoering): About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those settings: {code} spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py {code} I don't know they still work but personally I would close the ticket and not put this in the doc. I think it is not the right way to to it as it doens't scale to 100 executors and can produce race conditions for the task running on the same executor (multiple pip installs at the same time on the same node) > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177765#comment-17177765 ] Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:23 PM: - About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those settings: {code} spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py {code} I don't know they still work but personally I would close the ticket and not put this in the doc. I think it is not the right way to to it as it doens't scale to 100 executors and can produce race conditions for the task running on the same executor (multiple pip installs at the same time on the same node) was (Author: fhoering): About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those settings: spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py I don't know they still work but personally I would close the ticket and not put this in the doc. I think it is not the right way to to it as it doens't scale to 100 executors and can produce race conditions for the task running on the same executor (multiple pip installs at the same time on the same node) > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177765#comment-17177765 ] Fabian Höring commented on SPARK-32187: --- About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those settings: spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py I don't know they still work but personally I would close the ticket and not put this in the doc. I think it is not the right way to to it as it doens't scale to 100 executors and can produce race conditions for the task running on the same executor (multiple pip installs at the same time on the same node) > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177760#comment-17177760 ] Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:10 PM: - [~hyukjin.kwon] I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. I think I can also write about py-files and zipped envs. Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc. [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe] I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time to have a look and give some feedback on the comments below. Some considerations: It is structured around the vectorized udf example: - Using PEX - Using a zipped virtual environment - Using py files - What about the Spark jars ? I references those external tools. I don't have any affiliation to those tools: - [https://github.com/pantsbuild/pex] - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo. Some stuff I'm not sure about: {quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit or setting ``spark.yarn.dist.archives`` configuration. {quote} I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc and the spark code. So it seems inconsistent. Can you check or confirm ? {quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code. {quote} I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works. There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it. was (Author: fhoering): [~hyukjin.kwon] I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. I think I can also write about py-files and zipped envs. Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc. [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe] I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time to have a look and give some feedback on the comments below. Some considerations: It is structured around the vectorized udf example: - Using PEX - Using a zipped virtual environment - Using py files - What about the Spark jars ? I references those external tools. I don't have any affiliation to those tools: - [https://github.com/pantsbuild/pex] - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo. Some stuff I'm not sure about: {quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit or setting ``spark.yarn.dist.archives`` configuration. {quote} I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ? {quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code. {quote} I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works. There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it. > User Guide - Shipping Python Package > >
[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177760#comment-17177760 ] Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:08 PM: - [~hyukjin.kwon] I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. I think I can also write about py-files and zipped envs. Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc. [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe] I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time to have a look and give some feedback on the comments below. Some considerations: It is structured around the vectorized udf example: - Using PEX - Using a zipped virtual environment - Using py files - What about the Spark jars ? I references those external tools. I don't have any affiliation to those tools: - [https://github.com/pantsbuild/pex] - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo. Some stuff I'm not sure about: {quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit or setting ``spark.yarn.dist.archives`` configuration. {quote} I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ? {quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code. {quote} I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works. There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it. was (Author: fhoering): [~hyukjin.kwon] I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. I think I can also write about py-files and zipped envs. Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc. [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe] I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time have a look and give some feedback on the comments below. Some considerations: It is structured around the vectorized udf example: - Using PEX - Using a zipped virtual environment - Using py files - What about the Spark jars ? I references those external tools. I don't have any affiliation to those tools: - [https://github.com/pantsbuild/pex] - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo. Some stuff I'm not sure about: {quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit or setting ``spark.yarn.dist.archives`` configuration. {quote} I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ? {quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code. {quote} I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works. There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it. > User Guide - Shipping Python Package > > >
[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177760#comment-17177760 ] Fabian Höring commented on SPARK-32187: --- [~hyukjin.kwon] I started working on it. The new doc looks pretty nice ! Thanks for the effort on this. I think I can also write about py-files and zipped envs. Here is a first (in progress) draft. I will make it consistent across the examples. All links target the current doc. [https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe] I will be in holidays for 2 weeks. So no progress will be done. It would be nice if you have time have a look and give some feedback on the comments below. Some considerations: It is structured around the vectorized udf example: - Using PEX - Using a zipped virtual environment - Using py files - What about the Spark jars ? I references those external tools. I don't have any affiliation to those tools: - [https://github.com/pantsbuild/pex] - [https://conda.github.io/conda-pack/spark.html] => seems the only alternative for conda for now afaik - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, personally I would recommend to use pex because it is self contained but for completeness I added it I also referenced my docker spark standalone e2e example => I don't really want to promote my own stuff here but I think it could probably be helpful for people to have something running directly, the examples always strip some code, if you think it should not be there we can remove it. I don't mind also moving it to the spark repo. Some stuff I'm not sure about: {quote}The unzip will be done by Spark when using target ``--archives`` option in spark-submit or setting ``spark.yarn.dist.archives`` configuration. {quote} I seems like there is no way to set the archives as a config param when not running on YARN. I checked the doc the the spark code. So it seems inconsistent. Can you check or confirm ? {quote}It doesn't allow to add packages built as `Wheels <[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing to include dependencies with native code. {quote} I think it is the case but we need to check to be sure that it doesn't say something wrong. I can try by adding some wheel and see if it works. There is maybe one sentence to say about docker also. Basically what is described here is the lightweight Python way to do it. > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177749#comment-17177749 ] Apache Spark commented on SPARK-32526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29433 > Let sql/catalyst module tests pass for Scala 2.13 > - > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-and-aborted-20200806 > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177748#comment-17177748 ] Apache Spark commented on SPARK-32526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29433 > Let sql/catalyst module tests pass for Scala 2.13 > - > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-and-aborted-20200806 > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708 ] Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/14/20, 11:22 AM: --- Thanks [~saurabhc100] I am going ahead and merging these changes to our product which is on Spark_3.0. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. was (Author: ramks): Thanks [~saurabhc100] I am going ahead and merging these changes to my local Spark_3.0 setup. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) >
[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708 ] Ramakrishna Prasad K S commented on SPARK-32234: Thanks [~saurabhc100] I am going ahead and merging these changes to my local Spark_3.0 setup. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > > The reason behind this initBatch is not getting the schema that is needed to > find out the column value in OrcFi
[jira] [Created] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.
Prashant Sharma created SPARK-32617: --- Summary: Upgrade kubernetes client version to support latest minikube version. Key: SPARK-32617 URL: https://issues.apache.org/jira/browse/SPARK-32617 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.0 Reporter: Prashant Sharma Following error comes, when the k8s integration tests are run against the minikube cluster with version 1.2.1 {code:java} Run starting. Expected test count is: 18 KubernetesSuite: org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED *** io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62) at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51) at io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) ... Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at java.nio.file.Files.readAllBytes(Files.java:3152) at io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) ... Run completed in 1 second, 821 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** [INFO] [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 4.454 s] [INFO] Spark Project Tags . SUCCESS [ 4.768 s] [INFO] Spark Project Local DB . SUCCESS [ 2.961 s] [INFO] Spark Project Networking ... SUCCESS [ 4.258 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 5.703 s] [INFO] Spark Project Unsafe ... SUCCESS [ 3.239 s] [INFO] Spark Project Launcher . SUCCESS [ 3.224 s] [INFO] Spark Project Core . SUCCESS [02:25 min] [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 s] [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 03:12 min [INFO] Finished at: 2020-08-11T06:26:15-05:00 [INFO] [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] {code} New minikube has support for profiles, which is simply enabled by upgrading the minikube version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177663#comment-17177663 ] Apache Spark commented on SPARK-32616: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29432 > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32616: Assignee: Apache Spark > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32616: Assignee: (was: Apache Spark) > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32616) Window operators should be added determinedly
[ https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-32616: - Issue Type: Bug (was: Improvement) > Window operators should be added determinedly > - > > Key: SPARK-32616 > URL: https://issues.apache.org/jira/browse/SPARK-32616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, the addWindow() method doesn't add window operators determinedly. > The same query could results in different plans (different window order) > because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32616) Window operators should be added determinedly
wuyi created SPARK-32616: Summary: Window operators should be added determinedly Key: SPARK-32616 URL: https://issues.apache.org/jira/browse/SPARK-32616 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi Currently, the addWindow() method doesn't add window operators determinedly. The same query could results in different plans (different window order) because of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177584#comment-17177584 ] Dustin Smith commented on SPARK-32046: -- [~maropu] yes one would expect the current timestamp to be applied per dataframe for each call. My Jira ticket is about the fact that all dataframes get the same timestamp. However, this isn't the case. Once current timestamp is called once, that is the time. That is, even if we have a new dataframe with a new execution query and a new query plan, the time will be the from the first call in shell. On Jupyter and ZP, it will increment twice before freezing. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32590) Remove fullOutput from RowDataSourceScanExec
[ https://issues.apache.org/jira/browse/SPARK-32590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32590: --- Assignee: Huaxin Gao > Remove fullOutput from RowDataSourceScanExec > > > Key: SPARK-32590 > URL: https://issues.apache.org/jira/browse/SPARK-32590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > `RowDataSourceScanExec` requires the full output instead of the scan output > after column pruning. However, in v2 code path, we don't have the full output > anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` > is actually meaningless so we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32590) Remove fullOutput from RowDataSourceScanExec
[ https://issues.apache.org/jira/browse/SPARK-32590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32590. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29415 [https://github.com/apache/spark/pull/29415] > Remove fullOutput from RowDataSourceScanExec > > > Key: SPARK-32590 > URL: https://issues.apache.org/jira/browse/SPARK-32590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > `RowDataSourceScanExec` requires the full output instead of the scan output > after column pruning. However, in v2 code path, we don't have the full output > anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` > is actually meaningless so we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException was: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException was: Reproduce Step {code:java} //代码占位符 sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} //代码占位符 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seco
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177398#comment-17177398 ] Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:28 AM: This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce this issue... The -Non-deterministic- current_timestamp expr can change output if cache broken in the case. So, I think it is not a good idea for applications to depend on those kinds of values, either way. was (Author: maropu): This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce this issue... The -Non-deterministic- exprs can change output if cache broken in the case. So, I think it is not a good idea for applications to depend on those kinds of values, either way. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177562#comment-17177562 ] Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:28 AM: >> The question when does a query evaluation start and stop? Do mutual >> exclusive dataframes being processed consist of the same query evaluation? >> If yes, then current timestamp's behavior in spark shell is correct; >> however, as user, that would be extremely undesirable behavior. I would >> rather cache the current timestamp and call it again for a new time. The evaluation of current_timestamp happens per dataframe just before invoking Spark jobs (more specifically, its done at the optimization stage in a driver side). >> Now if a query evaluation stops once it is executed and starts anew when >> another dataframe or action is called, then the behavior in shell and >> notebooks are incorrect. The notebooks are only correct for a few runs and >> then default to not changing. In normal cases, I think the behaviour of spark-shell is correct. But, I'm not sure what's going on ZP/Jupyter. If you want to make it robust, I think its better to use checkpoint instead of cache though. was (Author: maropu): >> The question when does a query evaluation start and stop? Do mutual >> exclusive dataframes being processed consist of the same query evaluation? >> If yes, then current timestamp's behavior in spark shell is correct; >> however, as user, that would be extremely undesirable behavior. I would >> rather cache the current timestamp and call it again for a new time. The evaluation of current_timestamp happens per dataframe just before invoking Spark jobs (more specifically, its done at the optimization stage in a driver side). >> Now if a query evaluation stops once it is executed and starts anew when >> another dataframe or action is called, then the behavior in shell and >> notebooks are incorrect. The notebooks are only correct for a few runs and >> then default to not changing. In normal cases, I think the behaviour of spark-shell is correct. But, I'm not sure what's going on ZP/Jupyter. If you want to make it robust, I think its better to use checkpoint instead of cache though. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177398#comment-17177398 ] Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:25 AM: This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce this issue... The -Non-deterministic- exprs can change output if cache broken in the case. So, I think it is not a good idea for applications to depend on those kinds of values, either way. was (Author: maropu): This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce this issue... Non-deterministic exprs can change output if cache broken in the case. So, I think it is not a good idea for applications to depend on those kinds of values, either way. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177562#comment-17177562 ] Takeshi Yamamuro commented on SPARK-32046: -- >> The question when does a query evaluation start and stop? Do mutual >> exclusive dataframes being processed consist of the same query evaluation? >> If yes, then current timestamp's behavior in spark shell is correct; >> however, as user, that would be extremely undesirable behavior. I would >> rather cache the current timestamp and call it again for a new time. The evaluation of current_timestamp happens per dataframe just before invoking Spark jobs (more specifically, its done at the optimization stage in a driver side). >> Now if a query evaluation stops once it is executed and starts anew when >> another dataframe or action is called, then the behavior in shell and >> notebooks are incorrect. The notebooks are only correct for a few runs and >> then default to not changing. In normal cases, I think the behaviour of spark-shell is correct. But, I'm not sure what's going on ZP/Jupyter. If you want to make it robust, I think its better to use checkpoint instead of cache though. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32615: Assignee: (was: Apache Spark) > Fix AQE aggregateMetrics java.util.NoSuchElementException > - > > Key: SPARK-32615 > URL: https://issues.apache.org/jira/browse/SPARK-32615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > Reproduce Step > {code:java} > //代码占位符 > sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite > -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is > EmptyHashedRelationWithAllNullKeys" > {code} > {code:java} > //代码占位符 > 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: > Uncaught exception in thread > element-tracking-store-workerjava.util.NoSuchElementException: key not found: > 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at > scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at > scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at > scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ > when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 > milliseconds) > {code} > This issue is mainly because during AQE, while sub-plan changed, the metrics > update is overwrite. for example, in this UT, change from > BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd > action it will try aggregate all metrics during the execution, which will > cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32615: Assignee: Apache Spark > Fix AQE aggregateMetrics java.util.NoSuchElementException > - > > Key: SPARK-32615 > URL: https://issues.apache.org/jira/browse/SPARK-32615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Assignee: Apache Spark >Priority: Minor > > Reproduce Step > {code:java} > //代码占位符 > sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite > -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is > EmptyHashedRelationWithAllNullKeys" > {code} > {code:java} > //代码占位符 > 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: > Uncaught exception in thread > element-tracking-store-workerjava.util.NoSuchElementException: key not found: > 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at > scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at > scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at > scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ > when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 > milliseconds) > {code} > This issue is mainly because during AQE, while sub-plan changed, the metrics > update is overwrite. for example, in this UT, change from > BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd > action it will try aggregate all metrics during the execution, which will > cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177551#comment-17177551 ] Apache Spark commented on SPARK-32615: -- User 'leanken' has created a pull request for this issue: https://github.com/apache/spark/pull/29431 > Fix AQE aggregateMetrics java.util.NoSuchElementException > - > > Key: SPARK-32615 > URL: https://issues.apache.org/jira/browse/SPARK-32615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > Reproduce Step > {code:java} > //代码占位符 > sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite > -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is > EmptyHashedRelationWithAllNullKeys" > {code} > {code:java} > //代码占位符 > 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: > Uncaught exception in thread > element-tracking-store-workerjava.util.NoSuchElementException: key not found: > 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at > scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at > scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at > scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ > when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 > milliseconds) > {code} > This issue is mainly because during AQE, while sub-plan changed, the metrics > update is overwrite. for example, in this UT, change from > BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd > action it will try aggregate all metrics during the execution, which will > cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Summary: Fix AQE aggregateMetrics java.util.NoSuchElementException (was: AQE aggregateMetrics java.util.NoSuchElementException) > Fix AQE aggregateMetrics java.util.NoSuchElementException > - > > Key: SPARK-32615 > URL: https://issues.apache.org/jira/browse/SPARK-32615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > Reproduce Step > {code:java} > //代码占位符 > sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite > -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is > EmptyHashedRelationWithAllNullKeys" > {code} > {code:java} > //代码占位符 > 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: > Uncaught exception in thread > element-tracking-store-workerjava.util.NoSuchElementException: key not found: > 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at > scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at > scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at > scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ > when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 > milliseconds) > {code} > This issue is mainly because during AQE, while sub-plan changed, the metrics > update is overwrite. for example, in this UT, change from > BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd > action it will try aggregate all metrics during the execution, which will > cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org