[jira] [Commented] (SPARK-29606) Improve EliminateOuterJoin performance
[ https://issues.apache.org/jira/browse/SPARK-29606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224007#comment-17224007 ] Asif commented on SPARK-29606: -- Have proposed following PR which completely solves the issue. [PR-33152|https://github.com/apache/spark/pull/30185] > Improve EliminateOuterJoin performance > -- > > Key: SPARK-29606 > URL: https://issues.apache.org/jira/browse/SPARK-29606 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:scala} > spark.sql( > """ > |CREATE TABLE `big_table1`(`adj_type_id` tinyint, `byr_cntry_id` > decimal(4,0), `sap_category_id` decimal(9,0), `lstg_site_id` decimal(9,0), > `lstg_type_code` decimal(4,0), `offrd_slng_chnl_grp_id` smallint, > `slr_cntry_id` decimal(4,0), `sold_slng_chnl_grp_id` smallint, > `bin_lstg_yn_id` tinyint, `bin_sold_yn_id` tinyint, `lstg_curncy_id` > decimal(4,0), `blng_curncy_id` decimal(4,0), `bid_count` decimal(18,0), > `ck_trans_count` decimal(18,0), `ended_bid_count` decimal(18,0), > `new_lstg_count` decimal(18,0), `ended_lstg_count` decimal(18,0), > `ended_success_lstg_count` decimal(18,0), `item_sold_count` decimal(18,0), > `gmv_us_amt` decimal(18,2), `gmv_byr_lc_amt` decimal(18,2), `gmv_slr_lc_amt` > decimal(18,2), `gmv_lstg_curncy_amt` decimal(18,2), `gmv_us_m_amt` > decimal(18,2), `rvnu_insrtn_fee_us_amt` decimal(18,6), > `rvnu_insrtn_fee_lc_amt` decimal(18,6), `rvnu_insrtn_fee_bc_amt` > decimal(18,6), `rvnu_insrtn_fee_us_m_amt` decimal(18,6), > `rvnu_insrtn_crd_us_amt` decimal(18,6), `rvnu_insrtn_crd_lc_amt` > decimal(18,6), `rvnu_insrtn_crd_bc_amt` decimal(18,6), > `rvnu_insrtn_crd_us_m_amt` decimal(18,6), `rvnu_fetr_fee_us_amt` > decimal(18,6), `rvnu_fetr_fee_lc_amt` decimal(18,6), `rvnu_fetr_fee_bc_amt` > decimal(18,6), `rvnu_fetr_fee_us_m_amt` decimal(18,6), `rvnu_fetr_crd_us_amt` > decimal(18,6), `rvnu_fetr_crd_lc_amt` decimal(18,6), `rvnu_fetr_crd_bc_amt` > decimal(18,6), `rvnu_fetr_crd_us_m_amt` decimal(18,6), `rvnu_fv_fee_us_amt` > decimal(18,6), `rvnu_fv_fee_slr_lc_amt` decimal(18,6), > `rvnu_fv_fee_byr_lc_amt` decimal(18,6), `rvnu_fv_fee_bc_amt` decimal(18,6), > `rvnu_fv_fee_us_m_amt` decimal(18,6), `rvnu_fv_crd_us_amt` decimal(18,6), > `rvnu_fv_crd_byr_lc_amt` decimal(18,6), `rvnu_fv_crd_slr_lc_amt` > decimal(18,6), `rvnu_fv_crd_bc_amt` decimal(18,6), `rvnu_fv_crd_us_m_amt` > decimal(18,6), `rvnu_othr_l_fee_us_amt` decimal(18,6), > `rvnu_othr_l_fee_lc_amt` decimal(18,6), `rvnu_othr_l_fee_bc_amt` > decimal(18,6), `rvnu_othr_l_fee_us_m_amt` decimal(18,6), > `rvnu_othr_l_crd_us_amt` decimal(18,6), `rvnu_othr_l_crd_lc_amt` > decimal(18,6), `rvnu_othr_l_crd_bc_amt` decimal(18,6), > `rvnu_othr_l_crd_us_m_amt` decimal(18,6), `rvnu_othr_nl_fee_us_amt` > decimal(18,6), `rvnu_othr_nl_fee_lc_amt` decimal(18,6), > `rvnu_othr_nl_fee_bc_amt` decimal(18,6), `rvnu_othr_nl_fee_us_m_amt` > decimal(18,6), `rvnu_othr_nl_crd_us_amt` decimal(18,6), > `rvnu_othr_nl_crd_lc_amt` decimal(18,6), `rvnu_othr_nl_crd_bc_amt` > decimal(18,6), `rvnu_othr_nl_crd_us_m_amt` decimal(18,6), > `rvnu_slr_tools_fee_us_amt` decimal(18,6), `rvnu_slr_tools_fee_lc_amt` > decimal(18,6), `rvnu_slr_tools_fee_bc_amt` decimal(18,6), > `rvnu_slr_tools_fee_us_m_amt` decimal(18,6), `rvnu_slr_tools_crd_us_amt` > decimal(18,6), `rvnu_slr_tools_crd_lc_amt` decimal(18,6), > `rvnu_slr_tools_crd_bc_amt` decimal(18,6), `rvnu_slr_tools_crd_us_m_amt` > decimal(18,6), `rvnu_unasgnd_us_amt` decimal(18,6), `rvnu_unasgnd_lc_amt` > decimal(18,6), `rvnu_unasgnd_bc_amt` decimal(18,6), `rvnu_unasgnd_us_m_amt` > decimal(18,6), `rvnu_ad_fee_us_amt` decimal(18,6), `rvnu_ad_fee_lc_amt` > decimal(18,6), `rvnu_ad_fee_bc_amt` decimal(18,6), `rvnu_ad_fee_us_m_amt` > decimal(18,6), `rvnu_ad_crd_us_amt` decimal(18,6), `rvnu_ad_crd_lc_amt` > decimal(18,6), `rvnu_ad_crd_bc_amt` decimal(18,6), `rvnu_ad_crd_us_m_amt` > decimal(18,6), `rvnu_othr_ad_fee_us_amt` decimal(18,6), > `rvnu_othr_ad_fee_lc_amt` decimal(18,6), `rvnu_othr_ad_fee_bc_amt` > decimal(18,6), `rvnu_othr_ad_fee_us_m_amt` decimal(18,6), `cre_date` date, > `cre_user` string, `upd_date` timestamp, `upd_user` string, > `cmn_mtrc_summ_dt` date) > |USING parquet > PARTITIONED BY (`cmn_mtrc_summ_dt`) > |""".stripMargin) > spark.sql( > """ > |CREATE TABLE `small_table1` (`CURNCY_ID` DECIMAL(9,0), > `CURNCY_PLAN_RATE` DECIMAL(18,6), `CRE_DATE` DATE, `CRE_USER` STRING, > `UPD_DATE` TIMESTAMP, `UPD_USER` STRING) > |USING parquet > |CLUSTERED BY (CURNCY_ID) > |SORTED BY (CURNCY_ID) > |INTO 1 BUCKETS > |""".stripMargin) > spark.sql( > """ > |CREATE TABLE `small_table2` (`cntry_id` DECIMAL(4,0), `curncy_id` > DECIMAL(4,0), `cntry_desc` STRIN
[jira] [Commented] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level
[ https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223992#comment-17223992 ] Apache Spark commented on SPARK-33308: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30212 > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level > -- > > Key: SPARK-33308 > URL: https://issues.apache.org/jira/browse/SPARK-33308 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level
[ https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33308: Assignee: (was: Apache Spark) > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level > -- > > Key: SPARK-33308 > URL: https://issues.apache.org/jira/browse/SPARK-33308 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level
[ https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33308: Assignee: Apache Spark > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level > -- > > Key: SPARK-33308 > URL: https://issues.apache.org/jira/browse/SPARK-33308 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in > parser level -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33229) UnsupportedOperationException when group by with cube
[ https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33229: -- Parent: SPARK-33307 Issue Type: Sub-task (was: Improvement) > UnsupportedOperationException when group by with cube > - > > Key: SPARK-33229 > URL: https://issues.apache.org/jira/browse/SPARK-33229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table test_cube using parquet as select id as a, id as b, id as c from > range(10); > select a, b, c, count(*) from test_cube group by 1, cube(2, 3); > {code} > {noformat} > spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3); > 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) > from test_cube group by 1, cube(2, 3)] > java.lang.UnsupportedOperationException > at > org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35) > at > org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35) > at > org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal
[ https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33233: -- Parent: SPARK-33307 Issue Type: Sub-task (was: Improvement) > CUBE/ROLLUP can't support UnresolvedOrdinal > --- > > Key: SPARK-33233 > URL: https://issues.apache.org/jira/browse/SPARK-33233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Now spark support group by ordinal, but cube/rollup/groupingsets not support > this. This pr make cube/rollup/grouping sets support group by ordinal -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33296) Format exception when use cube func && with cube
[ https://issues.apache.org/jira/browse/SPARK-33296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33296: -- Parent: SPARK-33307 Issue Type: Sub-task (was: Improvement) > Format exception when use cube func && with cube > > > Key: SPARK-33296 > URL: https://issues.apache.org/jira/browse/SPARK-33296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > spark-sql> explain extended select a, b, c from x group by cube(a, b, c) with > cube;spark-sql> explain extended select a, b, c from x group by cube(a, b, c) > with cube;20/10/30 11:16:50 ERROR SparkSQLDriver: Failed in [explain extended > select a, b, c from x group by cube(a, b, c) with > cube]java.lang.UnsupportedOperationException at > org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:36) > at > org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:36) > at > org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:61) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:269) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:285) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:285) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:285) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:184) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:132) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:162) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71) > at > org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:213) > at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:235) > at > org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:191) > at > org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:170) > at > org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:158) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33307) Refact GROUPING ANALYTICS
[ https://issues.apache.org/jira/browse/SPARK-33307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223990#comment-17223990 ] angerszhu commented on SPARK-33307: --- FYI [~maropu] [~cloud_fan] > Refact GROUPING ANALYTICS > - > > Key: SPARK-33307 > URL: https://issues.apache.org/jira/browse/SPARK-33307 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Support CUBE(...) and ROLLUP(...) in the parser level and support GROUPING > SETS(...) as group by expression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33309) Replace origin GROUPING SETS with new expression
angerszhu created SPARK-33309: - Summary: Replace origin GROUPING SETS with new expression Key: SPARK-33309 URL: https://issues.apache.org/jira/browse/SPARK-33309 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: angerszhu Replace current GroupingSets just using expression ``` case class GroupingSets( selectedGroupByExprs: Seq[Seq[Expression]], groupByExprs: Seq[Expression], child: LogicalPlan, aggregations: Seq[NamedExpression]) extends UnaryNode { override def output: Seq[Attribute] = aggregations.map(_.toAttribute) // Needs to be unresolved before its translated to Aggregate + Expand because output attributes // will change in analysis. override lazy val resolved: Boolean = false } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level
angerszhu created SPARK-33308: - Summary: support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level Key: SPARK-33308 URL: https://issues.apache.org/jira/browse/SPARK-33308 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: angerszhu support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33307) Refact GROUPING ANALYTICS
[ https://issues.apache.org/jira/browse/SPARK-33307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33307: -- Description: Support CUBE(...) and ROLLUP(...) in the parser level and support GROUPING SETS(...) as group by expression > Refact GROUPING ANALYTICS > - > > Key: SPARK-33307 > URL: https://issues.apache.org/jira/browse/SPARK-33307 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Support CUBE(...) and ROLLUP(...) in the parser level and support GROUPING > SETS(...) as group by expression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33307) Refact GROUPING ANALYTICS
angerszhu created SPARK-33307: - Summary: Refact GROUPING ANALYTICS Key: SPARK-33307 URL: https://issues.apache.org/jira/browse/SPARK-33307 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33305: Assignee: (was: Apache Spark) > DSv2: DROP TABLE command should also invalidate cache > - > > Key: SPARK-33305 > URL: https://issues.apache.org/jira/browse/SPARK-33305 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the > table but doesn't invalidate all caches referencing the table. We should make > the behavior consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33305: Assignee: Apache Spark > DSv2: DROP TABLE command should also invalidate cache > - > > Key: SPARK-33305 > URL: https://issues.apache.org/jira/browse/SPARK-33305 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the > table but doesn't invalidate all caches referencing the table. We should make > the behavior consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223974#comment-17223974 ] Apache Spark commented on SPARK-33305: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30211 > DSv2: DROP TABLE command should also invalidate cache > - > > Key: SPARK-33305 > URL: https://issues.apache.org/jira/browse/SPARK-33305 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the > table but doesn't invalidate all caches referencing the table. We should make > the behavior consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33306) TimezoneID is needed when there cast from Date to String
EdisonWang created SPARK-33306: -- Summary: TimezoneID is needed when there cast from Date to String Key: SPARK-33306 URL: https://issues.apache.org/jira/browse/SPARK-33306 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33150) Groupby key may not be unique when using window
[ https://issues.apache.org/jira/browse/SPARK-33150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223971#comment-17223971 ] Aoyuan Liao commented on SPARK-33150: - [~DieterDP] The time difference is passed via fold attribute. As you can see, self.collect() is equal to output_collect in your code. [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.py#L133] >From this step, two tz-native datetimes are treated the same: {code:java} >>> print(output_collect) [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)] >>> pdf = pd.DataFrame.from_records(output_collect, columns=output.columns) >>> pdf window min_value 0 2019-10-27 02:54:00 1 1 2019-10-27 02:54:00 3 >>> pdf.dtypes window datetime64[ns] min_value int64 dtype: object {code} If this function doesn't respect fold attribute, we have to save array of fold values and use it to convert to machine timezone. It means every datetime entry needs to interate one by one again, I am not sure how it would affect the performance. As I said, there is a workaround to enable arrow in pyspark. {code:java} >>> spark = (pyspark ... .sql ... .SparkSession ... .builder ... .master('local[1]') ... .config("spark.sql.session.timeZone", "UTC") ... .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \ ... .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') \ ... .config('spark.sql.execution.arrow.pyspark.enabled', True) ... .getOrCreate() ... ) >>> output_collect = output.collect() >>> output_pandas = output.toPandas() >>> print(output_collect) [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)] >>> print(output_pandas) window min_value 0 2019-10-27 00:54:00 1 1 2019-10-27 01:54:00 3 {code} > Groupby key may not be unique when using window > --- > > Key: SPARK-33150 > URL: https://issues.apache.org/jira/browse/SPARK-33150 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.3, 3.0.0 >Reporter: Dieter De Paepe >Priority: Major > > > Due to the way spark converts dates to local times, it may end up losing > details that allow it to differentiate instants when those times fall in the > transition for daylight savings time. Setting the spark timezone to UTC does > not resolve the issue. > This issue is somewhat related to SPARK-32123, but seems independent enough > to consider this a separate issue. > A minimal example is below. I tested these on Spark 3.0.0 and 2.3.3 (I could > not get 2.4.x to work on my system). My machine is located in timezone > "Europe/Brussels". > > {code:java} > import pyspark > import pyspark.sql.functions as f > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \ > .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') > .getOrCreate() > ) > debug_df = spark.createDataFrame([ > (1572137640, 1), > (1572137640, 2), > (1572141240, 3), > (1572141240, 4) > ],['epochtime', 'value']) > debug_df \ > .withColumn('time', f.from_unixtime('epochtime')) \ > .withColumn('window', f.window('time', '1 minute').start) \ > .collect() > {code} > > Output, here we see the window function internally transforms the times to > local time, and as such has to disambiguate between the Belgian winter and > summer hour transition by setting the "fold" attribute: > > {code:java} > [Row(epochtime=1572137640, value=1, time='2019-10-27 00:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54)), > Row(epochtime=1572137640, value=2, time='2019-10-27 00:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54)), > Row(epochtime=1572141240, value=3, time='2019-10-27 01:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54, fold=1)), > Row(epochtime=1572141240, value=4, time='2019-10-27 01:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54, fold=1))]{code} > > Now, this has severe implications when we use the window function for a > groupby operation: > > {code:java} > output = debug_df \ > .withColumn('time', f.from_unixtime('epochtime')) \ > .groupby(f.window('time', '1 minute').start.alias('window')).agg( >f.min('value').alias('min_value') > ) > output_collect = output.collect() > output_pandas = output.toPandas() > print(output_collect) > print(output_pandas) > {code} > Output: > > {code:java} > [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), > Row(window=datetime.datetime(2019, 10, 27, 2, 54, f
[jira] [Created] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache
Chao Sun created SPARK-33305: Summary: DSv2: DROP TABLE command should also invalidate cache Key: SPARK-33305 URL: https://issues.apache.org/jira/browse/SPARK-33305 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Chao Sun Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the table but doesn't invalidate all caches referencing the table. We should make the behavior consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-32037. --- Fix Version/s: 3.1.0 Resolution: Fixed > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Thomas Graves >Priority: Minor > Fix For: 3.1.0 > > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-32037: - Assignee: Thomas Graves > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Thomas Graves >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29683) Job failed due to executor failures all available nodes are blacklisted
[ https://issues.apache.org/jira/browse/SPARK-29683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223929#comment-17223929 ] Dongwook Kwon commented on SPARK-29683: --- I agree with Genmao, these logics that added by SPARK-16630 is too strong condition to make application fail. {code:java} def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ allocatorBlacklist.keySet {code} I think the above logic would work only for partial failure or intermittent issue that numClusterNodes isn't changed, in case of a scheduler or allocator failure become permanent failure that ResourceManager has to remove from its pool which numClusterNodes is changed, the above logic could fail. e.g) let's say a cluster has 2 NodeManagers(numClusterNodes = 2), and one NodeManager(N1) has the some issues that cause scheduling failures which ends up increasing schedulerBlacklist.size to 1, and later N1 can't recover from ResourceManager's perspective due to a hardware failure or decommissioned by operator or any other ways, in this case numClusterNodes becomes 1 which makes isAllNodeBlacklisted true, even if there is still 1 NodeManager available and "spark.yarn.blacklist.executor.launch.blacklisting.enabled" set to false Particularly in cloud environment, resizing of cluster happens all the times, for long-running spark application with many resize operations of cluster, schedulerBlacklist.size could keep increasing while numClusterNodes keep fluctuated, in addition even if currentBlacklistedYarnNodes.size >= numClusterNodes is true case, there could be new nodes would be added quickly. I found [e70df2cea46f71461d8d401a420e946f999862c1|https://github.com/apache/spark/commit/e70df2cea46f71461d8d401a420e946f999862c1] was added to handle the case of numClusterNode = 0. However for other cases as mentioned in this JIRA, I think just removing the following part from [ApplicationMaster|https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L535-L538] would more make sense, because isAllNodeBlackListed doesn't necessarily mean running application needs to fail. {code:java} } else if (allocator.isAllNodeBlacklisted) { finish(FinalApplicationStatus.FAILED, ApplicationMaster.EXIT_MAX_EXECUTOR_FAILURES, "Due to executor failures all available nodes are blacklisted") {code} Or at least the above condition should apply as optional with "spark.yarn.blacklist.executor.launch.blacklisting.enabled" or some new configuration because SPARK-16630 added as optional but the above logic impact regardless of any configuration. I wonder other's opinion about this. > Job failed due to executor failures all available nodes are blacklisted > --- > > Key: SPARK-29683 > URL: https://issues.apache.org/jira/browse/SPARK-29683 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Priority: Major > > My streaming job will fail *due to executor failures all available nodes are > blacklisted*. This exception is thrown only when all node is blacklisted: > {code:java} > def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= > numClusterNodes > val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ > allocatorBlacklist.keySet > {code} > After diving into the code, I found some critical conditions not be handled > properly: > - unchecked `excludeNodes`: it comes from user config. If not set properly, > it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For > example, we may set some nodes not in Yarn cluster. > {code:java} > excludeNodes = (invalid1, invalid2, invalid3) > clusterNodes = (valid1, valid2) > {code} > - `numClusterNodes` may equals 0: When HA Yarn failover, it will take some > time for all NodeManagers to register ResourceManager again. In this case, > `numClusterNode` may equals 0 or some other number, and Spark driver failed. > - too strong condition check: Spark driver will fail as long as > "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should > not indicate a unrecovered fatal. For example, there are some NodeManagers > restarting. So we can give some waiting time before job failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223925#comment-17223925 ] Apache Spark commented on SPARK-33259: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30210 > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Critical > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMetadataQuery = sessi
[jira] [Assigned] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33259: Assignee: (was: Apache Spark) > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Critical > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMetadataQuery = sessionStartsWithMetadata > .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see > which query
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223924#comment-17223924 ] Apache Spark commented on SPARK-33259: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30210 > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Critical > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMetadataQuery = sessi
[jira] [Assigned] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33259: Assignee: Apache Spark > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Assignee: Apache Spark >Priority: Critical > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMetadataQuery = sessionStartsWithMetadata > .select(lit("sessionStartsWithMetadata"), col("*")) // Add la
[jira] [Closed] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations
[ https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-32661. - > Spark executors on K8S do not request extra memory for off-heap allocations > --- > > Key: SPARK-32661 > URL: https://issues.apache.org/jira/browse/SPARK-32661 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Luca Canali >Priority: Minor > > Off-heap memory allocations are configured using > `spark.memory.offHeap.enabled=true` and `conf > spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory > size to the executor container resources. Spark on Kubernetes does not > request the allocation of the off-heap memory. Currently, this can be worked > around by using spark.executor.memoryOverhead to reserve memory for off-heap > allocations. This proposes make Spark on Kubernetes behave as in the case of > YARN, that is adding spark.memory.offHeap.size to the memory request for > executor containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations
[ https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32661. --- Resolution: Duplicate This is superseded by SPARK-33288. > Spark executors on K8S do not request extra memory for off-heap allocations > --- > > Key: SPARK-32661 > URL: https://issues.apache.org/jira/browse/SPARK-32661 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Luca Canali >Priority: Minor > > Off-heap memory allocations are configured using > `spark.memory.offHeap.enabled=true` and `conf > spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory > size to the executor container resources. Spark on Kubernetes does not > request the allocation of the off-heap memory. Currently, this can be worked > around by using spark.executor.memoryOverhead to reserve memory for off-heap > allocations. This proposes make Spark on Kubernetes behave as in the case of > YARN, that is adding spark.memory.offHeap.size to the memory request for > executor containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26888) Upgrade to Log4j 2.x
[ https://issues.apache.org/jira/browse/SPARK-26888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26888. - > Upgrade to Log4j 2.x > > > Key: SPARK-26888 > URL: https://issues.apache.org/jira/browse/SPARK-26888 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > 5 August 2015 --The Apache Logging Services™ Project Management Committee > (PMC) has announced that the Log4j™ 1.x logging framework has reached its end > of life (EOL) and is no longer officially supported. > Please see the related blog post: > https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces > Migration guide how to migrate from Log4j 1.x to 2.x: > https://logging.apache.org/log4j/2.x/manual/migration.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33261) Allow people to extend the pod feature steps
[ https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33261: -- Parent: SPARK-33005 Issue Type: Sub-task (was: Improvement) > Allow people to extend the pod feature steps > > > Key: SPARK-33261 > URL: https://issues.apache.org/jira/browse/SPARK-33261 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > While we allow people to specify pod templates, some deployments could > benefit from being able to add a feature step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33098) Explicit or implicit casts involving partition columns can sometimes result in a MetaException.
[ https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223812#comment-17223812 ] Apache Spark commented on SPARK-33098: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/30207 > Explicit or implicit casts involving partition columns can sometimes result > in a MetaException. > --- > > Key: SPARK-33098 > URL: https://issues.apache.org/jira/browse/SPARK-33098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > The following cases throw > {{MetaException(message:Filtering is supported only on partition keys of type > string)}} > {noformat} > sql("create table test (a int) partitioned by (b int) stored as parquet") > sql("insert into test values (1, 1), (1, 2), (2, 2)") > // These throw MetaExceptions > sql("select * from test where b in ('2')").show(false) > sql("select * from test where cast(b as string) = '2'").show(false) > sql("select * from test where cast(b as string) in ('2')").show(false) > sql("select * from test where cast(b as string) in (2)").show(false) > sql("select cast(b as string) as b from test where b in ('2')").show(false) > sql("select cast(b as string) as b from test").filter("b = '2'").show(false) > // [1] > sql("select cast(b as string) as b from test").filter("b in (2)").show(false) > // [2] > sql("select cast(b as string) as b from test").filter("b in > ('2')").show(false) > sql("select * from test where cast(b as string) > '1'").show(false) > sql("select cast(b as string) b from test").filter("b > '1'").show(false) // > [3] > // [1] but not sql("select cast(b as string) as b from test where b = > '2'").show(false) > // [2] but not sql("select cast(b as string) as b from test where b in > (2)").show(false) > // [3] but not sql("select cast(b as string) b from test where b > > '1'").show(false) > {noformat} > The message ("Filtering is supported only on partition keys of type string") > is misleading. Filter *is* supported on integer columns, for example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223807#comment-17223807 ] Apache Spark commented on SPARK-33262: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/30205 > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33261) Allow people to extend the pod feature steps
[ https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223806#comment-17223806 ] Apache Spark commented on SPARK-33261: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/30206 > Allow people to extend the pod feature steps > > > Key: SPARK-33261 > URL: https://issues.apache.org/jira/browse/SPARK-33261 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > While we allow people to specify pod templates, some deployments could > benefit from being able to add a feature step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33261) Allow people to extend the pod feature steps
[ https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33261: Assignee: (was: Apache Spark) > Allow people to extend the pod feature steps > > > Key: SPARK-33261 > URL: https://issues.apache.org/jira/browse/SPARK-33261 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > While we allow people to specify pod templates, some deployments could > benefit from being able to add a feature step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33261) Allow people to extend the pod feature steps
[ https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33261: Assignee: Apache Spark > Allow people to extend the pod feature steps > > > Key: SPARK-33261 > URL: https://issues.apache.org/jira/browse/SPARK-33261 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > > While we allow people to specify pod templates, some deployments could > benefit from being able to add a feature step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33259: -- Priority: Critical (was: Major) > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Critical > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMetadataQuery = sessionStartsWithMetadata > .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see > which query's out
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223779#comment-17223779 ] Dongjoon Hyun commented on SPARK-33259: --- Is there a way to block this at the analysis stage instead of giving an incorrect output? cc [~viirya] and [~dbtsai] > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Major > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMet
[jira] [Commented] (SPARK-33301) Code grows beyond 64 KB when compiling large CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223776#comment-17223776 ] Dongjoon Hyun commented on SPARK-33301: --- Thank you, [~yumwang]. > Code grows beyond 64 KB when compiling large CaseWhen > - > > Key: SPARK-33301 > URL: https://issues.apache.org/jira/browse/SPARK-33301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, > AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET; > SELECT T.*, > CASE > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%4D%') THEN 'SPARK 4D' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%BASKETBALL%') THEN 'SPARK Basketball' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%BASKETBALL%') THEN 'SPARK Basketball' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN > 'EQT' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') > THEN 'Originals' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') > THEN 'Prophere' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', > '%1%') THEN 'Yung 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE
[jira] [Closed] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency
[ https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-33279. - > Spark 3.0 failure due to lack of Arrow dependency > - > > Key: SPARK-33279 > URL: https://issues.apache.org/jira/browse/SPARK-33279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > A recent change in Arrow has split the arrow-memory module into 3, so client > code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe). > This has been done in the master branch of Spark, but not in the branch-3.0 > branch, this is causing the build in branch-3.0 to fail > (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33280) Spark 3.0 serialization issue
[ https://issues.apache.org/jira/browse/SPARK-33280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223762#comment-17223762 ] L. C. Hsieh commented on SPARK-33280: - Yeah, I thought you can work around with a {{lazy}} or {{transient}} on {{sc}}. It seems like ClosureCleaner cannot clean up {{sc}} in {{trait Stage}}. You can also try to inline the {{trait}} into {{SampleStage}}. > Spark 3.0 serialization issue > - > > Key: SPARK-33280 > URL: https://issues.apache.org/jira/browse/SPARK-33280 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 > Environment: OS: MacOS Catalina 10.15.7 > +Environment 1:+ > Spark: spark-3.0.1-bin-hadoop2.7 > code compiled with : scala 2.12 > +Environment 2:+ > spark-2.4.7-bin-hadoop2.7 > code compiled with: scala 2.12 > +Environment 3 (This succeeds):+ > spark-2.4.7-bin-hadoop2.7 > code compiled with: scala 2.11 > Reproducible on YARN cluster with 3.0.1 >Reporter: Kiran Kumar Joseph >Priority: Major > > This code which does a simple RDD map operation and saves the data fails on > executor with a deserialization error. > {noformat} > package test.pkg > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > trait Stage { > val sc = SparkContext.getOrCreate() > sc.setCheckpointDir(sc.getConf.get("spark.checkpoint.dir", > "/tmp/checkpoints")) > val app = new App(sc) > } > class App(sc: SparkContext) { > def write(data: RDD[String]): Unit = { > if (!data.isEmpty()) > data.saveAsTextFile("/tmp/test-output") > } > } > object SampleStage extends Stage { > def main(args: Array[String]): Unit = { > val sampleList = Seq("a", "b", "c") > writeList(sampleList) > sc.stop() > } > def writeList(data: Seq[String]) = { > val dataRDD = sc.parallelize(data).map(x => x.toLowerCase()) > app.write(dataRDD) > } > } {noformat} > > If I change the line > {noformat} > val dataRDD = sc.parallelize(data).map(x => x.toLowerCase()) {noformat} > to > {noformat} > val dataRDD = sc.parallelize(data) {noformat} > there is no exception and job succeeds > Sparksubmit command: > {noformat} > $SPARK_HOME/bin/spark-submit --master spark://masterhost:port --conf > spark.app.name=SampleStage --executor-memory 2g --num-executors 2 --class > test.pkg.SampleStage --deploy-mode client --conf spark.driver.cores=4 > sampleStage.jar {noformat} > > Executor stacktrace: > {noformat} > Driver stacktrace: > 20/10/28 22:55:34 INFO DAGScheduler: Job 0 failed: isEmpty at App.scala:13, > took 1.931078 s > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: > Lost task 0.3 in stage 0.0 (TID 3, 192.168.86.20, executor 0): > java.lang.NoClassDefFoundError: test.pkg.SampleStage$ (initialization failure) > at > java.lang.J9VMInternals.initializationAlreadyFailed(J9VMInternals.java:98) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2258) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2247) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2247) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394) > at > java.io.Objec
[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency
[ https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223761#comment-17223761 ] Dongjoon Hyun commented on SPARK-33279: --- Thank you, [~fan_li_ya] and [~hyukjin.kwon]. > Spark 3.0 failure due to lack of Arrow dependency > - > > Key: SPARK-33279 > URL: https://issues.apache.org/jira/browse/SPARK-33279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > A recent change in Arrow has split the arrow-memory module into 3, so client > code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe). > This has been done in the master branch of Spark, but not in the branch-3.0 > branch, this is causing the build in branch-3.0 to fail > (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33295) Upgrade ORC to 1.6.6
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33295: -- Summary: Upgrade ORC to 1.6.6 (was: Upgrade ORC to 1.6.1) > Upgrade ORC to 1.6.6 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.1.0 >Reporter: Jaanai Zhang >Priority: Major > Fix For: 3.1.0 > > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33295) Upgrade ORC to 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223754#comment-17223754 ] Dongjoon Hyun commented on SPARK-33295: --- Until ORC 1.6.5, it's technically infeasible. > Upgrade ORC to 1.6.1 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.1.0 >Reporter: Jaanai Zhang >Priority: Major > Fix For: 3.1.0 > > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33295) Upgrade ORC to 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223753#comment-17223753 ] Dongjoon Hyun commented on SPARK-33295: --- Hi, [~jaanai]. Yes. I've been working on that and it's ready for Apache Spark `sql/core` module. - ORC-669 - ORC-670 - ORC-671 - ORC-676 - ORC-677 However, it turns out that Apache ORC broke many things at Apache Hive still. So, it doesn't work with Apache Spark `sql/hive` and `sql/hive-thriftserver` module. > Upgrade ORC to 1.6.1 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.1.0 >Reporter: Jaanai Zhang >Priority: Major > Fix For: 3.1.0 > > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33290) REFRESH TABLE should invalidate cache even though the table itself may not be cached
[ https://issues.apache.org/jira/browse/SPARK-33290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33290: -- Issue Type: Bug (was: Improvement) > REFRESH TABLE should invalidate cache even though the table itself may not be > cached > > > Key: SPARK-33290 > URL: https://issues.apache.org/jira/browse/SPARK-33290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > For the following example: > {code} > CREATE TABLE t ...; > CREATE VIEW t1 AS SELECT * FROM t; > REFRESH TABLE t > {code} > If t is cached, t1 will be invalidated. However if t is not cached as above, > the REFRESH command won't invalidate view t1. This could lead to incorrect > result if the view is used later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33290) REFRESH TABLE should invalidate cache even though the table itself may not be cached
[ https://issues.apache.org/jira/browse/SPARK-33290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33290: -- Labels: correctness (was: ) > REFRESH TABLE should invalidate cache even though the table itself may not be > cached > > > Key: SPARK-33290 > URL: https://issues.apache.org/jira/browse/SPARK-33290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > Labels: correctness > > For the following example: > {code} > CREATE TABLE t ...; > CREATE VIEW t1 AS SELECT * FROM t; > REFRESH TABLE t > {code} > If t is cached, t1 will be invalidated. However if t is not cached as above, > the REFRESH command won't invalidate view t1. This could lead to incorrect > result if the view is used later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations
[ https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223727#comment-17223727 ] Apache Spark commented on SPARK-32661: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/30204 > Spark executors on K8S do not request extra memory for off-heap allocations > --- > > Key: SPARK-32661 > URL: https://issues.apache.org/jira/browse/SPARK-32661 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Luca Canali >Priority: Minor > > Off-heap memory allocations are configured using > `spark.memory.offHeap.enabled=true` and `conf > spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory > size to the executor container resources. Spark on Kubernetes does not > request the allocation of the off-heap memory. Currently, this can be worked > around by using spark.executor.memoryOverhead to reserve memory for off-heap > allocations. This proposes make Spark on Kubernetes behave as in the case of > YARN, that is adding spark.memory.offHeap.size to the memory request for > executor containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33288: Assignee: Apache Spark > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223723#comment-17223723 ] Apache Spark commented on SPARK-33288: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/30204 > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33288: Assignee: (was: Apache Spark) > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33304) Add from_avro and to_avro functions to SparkR
Maciej Szymkiewicz created SPARK-33304: -- Summary: Add from_avro and to_avro functions to SparkR Key: SPARK-33304 URL: https://issues.apache.org/jira/browse/SPARK-33304 Project: Spark Issue Type: Improvement Components: R, SQL Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 3.0, but are still missing in R API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33303) Deduplicate deterministic UDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223674#comment-17223674 ] Apache Spark commented on SPARK-33303: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/30203 > Deduplicate deterministic UDF calls > --- > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Major > > We run into an issue where a customer created a column with an expensive UDF > call and build a very complex logic on the the top of that column as new > derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the > UDF is called ~1000 times for each row which degraded the performance of the > query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33303: Assignee: Apache Spark > Deduplicate deterministic UDF calls > --- > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > We run into an issue where a customer created a column with an expensive UDF > call and build a very complex logic on the the top of that column as new > derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the > UDF is called ~1000 times for each row which degraded the performance of the > query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33303) Deduplicate deterministic UDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223675#comment-17223675 ] Apache Spark commented on SPARK-33303: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/30203 > Deduplicate deterministic UDF calls > --- > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Major > > We run into an issue where a customer created a column with an expensive UDF > call and build a very complex logic on the the top of that column as new > derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the > UDF is called ~1000 times for each row which degraded the performance of the > query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33303: Assignee: Apache Spark > Deduplicate deterministic UDF calls > --- > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > We run into an issue where a customer created a column with an expensive UDF > call and build a very complex logic on the the top of that column as new > derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the > UDF is called ~1000 times for each row which degraded the performance of the > query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33303: Assignee: (was: Apache Spark) > Deduplicate deterministic UDF calls > --- > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Major > > We run into an issue where a customer created a column with an expensive UDF > call and build a very complex logic on the the top of that column as new > derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the > UDF is called ~1000 times for each row which degraded the performance of the > query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33303) Deduplicate deterministic UDF calls
Peter Toth created SPARK-33303: -- Summary: Deduplicate deterministic UDF calls Key: SPARK-33303 URL: https://issues.apache.org/jira/browse/SPARK-33303 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Peter Toth We run into an issue where a customer created a column with an expensive UDF call and build a very complex logic on the the top of that column as new derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the UDF is called ~1000 times for each row which degraded the performance of the query significantly. The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to avoid performance degradation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32591) Add better api docs for stage level scheduling Resources
[ https://issues.apache.org/jira/browse/SPARK-32591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32591: -- Description: A question came up when we added offheap memory to be able to set in a ResourceProfile executor resources. [https://github.com/apache/spark/pull/28972/] Based on that discussion we should add better api docs to explain what each one does. Perhaps point to the corresponding configuration . Also specify the default config is used if not specified in the profiles for those that fallback. was: A question came up when we added offheap memory to be able to set in a ResourceProfile executor resources. [https://github.com/apache/spark/pull/28972/] Based on that discussion we should add better api docs to explain what each one does. Perhaps point to the corresponding configuration . > Add better api docs for stage level scheduling Resources > > > Key: SPARK-32591 > URL: https://issues.apache.org/jira/browse/SPARK-32591 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > A question came up when we added offheap memory to be able to set in a > ResourceProfile executor resources. > [https://github.com/apache/spark/pull/28972/] > Based on that discussion we should add better api docs to explain what each > one does. Perhaps point to the corresponding configuration . Also specify > the default config is used if not specified in the profiles for those that > fallback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33302) Failed to push filters through Expand
[ https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33302: Issue Type: Improvement (was: Bug) > Failed to push filters through Expand > - > > Key: SPARK-33302 > URL: https://issues.apache.org/jira/browse/SPARK-33302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 3.0.1, 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) > using parquet; > create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet; > SELECT >years, >appversion, >SUM(uusers) AS users > FROM (SELECT >Date_trunc('year', dt) AS years, >CASE > WHEN h.pid = 3 THEN 'iOS' > WHEN h.pid = 4 THEN 'Android' > ELSE 'Other' >END AS viewport, >h.vsAS appversion, >Count(DISTINCT u.uid) AS uusers >,Count(DISTINCT u.suid) AS srcusers > FROM SPARK_33302_1 u >join SPARK_33302_2 h > ON h.uid = u.uid > GROUP BY 1, > 2, > 3) AS a > WHERE viewport = 'iOS' > GROUP BY 1, > 2 > {code} > {noformat} > == Physical Plan == > *(5) HashAggregate(keys=[years#30, appversion#32], > functions=[sum(uusers#33L)]) > +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] >+- *(4) HashAggregate(keys=[years#30, appversion#32], > functions=[partial_sum(uusers#33L)]) > +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) > u.`uid`#47 else null)]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] > +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = > 1)) u.`uid`#47 else null)]) >+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], > functions=[]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` > AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), > true, [id=#241] > +- *(2) HashAggregate(keys=[date_trunc('year', > CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN > (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, > u.`suid`#48, gid#44], functions=[]) > +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' > WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) >+- *(2) Expand [ArrayBuffer(date_trunc(year, > cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS > WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), > ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE > WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, > vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, > CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE > 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] > +- *(2) Project [uid#7, dt#9, suid#10, pid#11, > vs#12] > +- *(2) BroadcastHashJoin [uid#7], [uid#13], > Inner, BuildRight > :- *(2) Project [uid#7, dt#9, suid#10] > : +- *(2) Filter isnotnull(uid#7) >
[jira] [Updated] (SPARK-33302) Failed to push filters through Expand
[ https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33302: Description: How to reproduce this issue: {code:sql} create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) using parquet; create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vsAS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM SPARK_33302_1 u join SPARK_33302_2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 {code} {noformat} == Physical Plan == *(5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- *(4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- *(2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- *(2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- *(2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- *(2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- *(2) Project [uid#7, dt#9, suid#10] : +- *(2) Filter isnotnull(uid#7) : +- *(2) ColumnarToRow :+- FileScan parquet default.spark_33301_1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/spark_33301_1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233] +- *
[jira] [Updated] (SPARK-33302) Failed to push down filters through Expand
[ https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33302: Summary: Failed to push down filters through Expand (was: Failed to push filters through Expand) > Failed to push down filters through Expand > -- > > Key: SPARK-33302 > URL: https://issues.apache.org/jira/browse/SPARK-33302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 3.0.1, 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) > using parquet; > create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet; > SELECT >years, >appversion, >SUM(uusers) AS users > FROM (SELECT >Date_trunc('year', dt) AS years, >CASE > WHEN h.pid = 3 THEN 'iOS' > WHEN h.pid = 4 THEN 'Android' > ELSE 'Other' >END AS viewport, >h.vsAS appversion, >Count(DISTINCT u.uid) AS uusers >,Count(DISTINCT u.suid) AS srcusers > FROM SPARK_33302_1 u >join SPARK_33302_2 h > ON h.uid = u.uid > GROUP BY 1, > 2, > 3) AS a > WHERE viewport = 'iOS' > GROUP BY 1, > 2 > {code} > {noformat} > == Physical Plan == > *(5) HashAggregate(keys=[years#30, appversion#32], > functions=[sum(uusers#33L)]) > +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] >+- *(4) HashAggregate(keys=[years#30, appversion#32], > functions=[partial_sum(uusers#33L)]) > +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) > u.`uid`#47 else null)]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] > +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = > 1)) u.`uid`#47 else null)]) >+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], > functions=[]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` > AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), > true, [id=#241] > +- *(2) HashAggregate(keys=[date_trunc('year', > CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN > (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, > u.`suid`#48, gid#44], functions=[]) > +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' > WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) >+- *(2) Expand [ArrayBuffer(date_trunc(year, > cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS > WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), > ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE > WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, > vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, > CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE > 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] > +- *(2) Project [uid#7, dt#9, suid#10, pid#11, > vs#12] > +- *(2) BroadcastHashJoin [uid#7], [uid#13], > Inner, BuildRight > :- *(2) Project [uid#7, dt#9, suid#10] >
[jira] [Created] (SPARK-33302) Failed to push filters through Expand
Yuming Wang created SPARK-33302: --- Summary: Failed to push filters through Expand Key: SPARK-33302 URL: https://issues.apache.org/jira/browse/SPARK-33302 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.3.4, 3.1.0 Reporter: Yuming Wang How to reproduce this issue: {code:sql} create table SPARK_33301_1(pid int, uid int, sid int, dt date, suid int) using parquet; create table SPARK_33301_2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vsAS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM SPARK_33301_1 u join SPARK_33301_2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 {code} {noformat} == Physical Plan == *(5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- *(4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- *(2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- *(2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- *(2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- *(2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- *(2) Project [uid#7, dt#9, suid#10] : +- *(2) Filter isnotnull(uid#7) : +- *(2) ColumnarToRow :+- FileScan parquet default.spark_33301_1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/spark_33301_1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct
[jira] [Updated] (SPARK-33301) Code grows beyond 64 KB when compiling large CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33301: Summary: Code grows beyond 64 KB when compiling large CaseWhen (was: Code grows beyond 64 KB when compiling large CASE_WHEN) > Code grows beyond 64 KB when compiling large CaseWhen > - > > Key: SPARK-33301 > URL: https://issues.apache.org/jira/browse/SPARK-33301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, > AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET; > SELECT T.*, > CASE > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%4D%') THEN 'SPARK 4D' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%BASKETBALL%') THEN 'SPARK Basketball' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', > '%BASKETBALL%') THEN 'SPARK Basketball' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN > 'EQT' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN > 'NMD' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') > THEN 'Originals' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', > '%SPARK%') THEN 'Other SPARK' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', > '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') > THEN 'Prophere' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', > '%BOOST%') THEN 'Ultra Boost' > WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', > '%1%') THEN 'Yung 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', > '%FORCE%', '%1%') THEN 'Air Force 1' > WHE
[jira] [Commented] (SPARK-33301) Code grows beyond 64 KB when compiling large CASE_WHEN
[ https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223603#comment-17223603 ] Yuming Wang commented on SPARK-33301: - {noformat} 20/10/30 05:01:40 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.InternalCompilerException: Compiling "GeneratedClass" in File 'generated.java', Line 1, Column 1: Compiling "project_doConsume_0(UTF8String project_expr_0_0, boolean project_exprIsNull_0_0, UTF8String project_expr_1_0, boolean project_exprIsNull_1_0, UTF8String project_expr_2_0, boolean project_exprIsNull_2_0, UTF8String project_expr_3_0, boolean project_exprIsNull_3_0, UTF8String project_expr_4_0, boolean project_exprIsNull_4_0, UTF8String project_expr_5_0, boolean project_exprIsNull_5_0)": Code grows beyond 64 KB org.codehaus.commons.compiler.InternalCompilerException: Compiling "GeneratedClass" in File 'generated.java', Line 1, Column 1: Compiling "project_doConsume_0(UTF8String project_expr_0_0, boolean project_exprIsNull_0_0, UTF8String project_expr_1_0, boolean project_exprIsNull_1_0, UTF8String project_expr_2_0, boolean project_exprIsNull_2_0, UTF8String project_expr_3_0, boolean project_exprIsNull_3_0, UTF8String project_expr_4_0, boolean project_exprIsNull_4_0, UTF8String project_expr_5_0, boolean project_exprIsNull_5_0)": Code grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:358) at org.codehaus.janino.UnitCompiler.access$000(UnitCompiler.java:231) at org.codehaus.janino.UnitCompiler$1.visitCompilationUnit(UnitCompiler.java:322) at org.codehaus.janino.UnitCompiler$1.visitCompilationUnit(UnitCompiler.java:319) at org.codehaus.janino.Java$CompilationUnit.accept(Java.java:367) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:319) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:278) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:272) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:252) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:82) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1389) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1487) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1484) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:719) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:718) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:382) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:67) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkS
[jira] [Created] (SPARK-33301) Code grows beyond 64 KB when compiling large CASE_WHEN
Yuming Wang created SPARK-33301: --- Summary: Code grows beyond 64 KB when compiling large CASE_WHEN Key: SPARK-33301 URL: https://issues.apache.org/jira/browse/SPARK-33301 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang How to reproduce this issue: {code:sql} CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET; SELECT T.*, CASE WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', '%4D%') THEN 'SPARK 4D' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', '%BASKETBALL%') THEN 'SPARK Basketball' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', '%BASKETBALL%') THEN 'SPARK Basketball' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN 'EQT' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 'NMD' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') THEN 'Originals' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', '%SPARK%') THEN 'Other SPARK' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', '%HUMAN%', '%RACE%') THEN 'Pharell Human Race' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') THEN 'Prophere' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', '%BOOST%') THEN 'Ultra Boost' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', '%BOOST%') THEN 'Ultra Boost' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', '%BOOST%') THEN 'Ultra Boost' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', '%BOOST%') THEN 'Ultra Boost' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', '%BOOST%') THEN 'Ultra Boost' WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', '%1%') THEN 'Yung 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%FORCE%', '%1%') THEN 'Air Force 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%FORCE%', '%1%') THEN 'Air Force 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%FORCE%', '%1%') THEN 'Air Force 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%FORCE%', '%1%') THEN 'Air Force 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%FORCE%', '%1%') THEN 'Air Force 1' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') THEN 'Air Max' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') THEN 'Air Max' WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') THEN 'Air Max'
[jira] [Updated] (SPARK-33300) Rule SimplifyCasts will not work for nested columns
[ https://issues.apache.org/jira/browse/SPARK-33300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chendihao updated SPARK-33300: -- Description: We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast. The SQL "select cast(string_date as string) from t1" will be optimized. {code:java} == Analyzed Logical Plan == string_date: string Project [cast(string_date#12 as string) AS string_date#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false {code} However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1". {code:java} == Analyzed Logical Plan == CAST(CAST(string_date AS STRING) AS STRING): string Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false {code} was: We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast. The SQL "select cast(string_date as string) from t1" will be optimized. ``` == Analyzed Logical Plan == string_date: string Project [cast(string_date#12 as string) AS string_date#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false ``` However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1". ``` == Analyzed Logical Plan == CAST(CAST(string_date AS STRING) AS STRING): string Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false ``` > Rule SimplifyCasts will not work for nested columns > --- > > Key: SPARK-33300 > URL: https://issues.apache.org/jira/browse/SPARK-33300 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: chendihao >Priority: Minor > > We use SparkSQL and Catalyst to optimize the Spark job. We have read the > source code and test the rule of SimplifyCasts which will work for simple SQL > without nested cast. > The SQL "select cast(string_date as string) from t1" will be optimized. > {code:java} > == Analyzed Logical Plan == > string_date: string > Project [cast(string_date#12 as string) AS string_date#24] > +- SubqueryAlias t1 > +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, > string_timestamp#13, timestamp_field#14, bool_field#15], false > == Optimized Logical Plan == > Project [string_date#12] > +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, > string_timestamp#13, timestamp_field#14, bool_field#15], false > {code} > However, it fail to optimize with the nested cast like this "select > cast(cast(string_date as string) as string) from t1". > {code:java} > == Analyzed Logical Plan == > CAST(CAST(string_date AS STRING) AS STRING): string > Project [cast(cast(string_date#12 as string) as string) AS > CAST(CAST(string_date AS STRING) AS STRING)#24] > +- SubqueryAlias t1 > +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, > string_timestamp#13, timestamp_field#14, bool_field#15], false > == Optimized Logical Plan == > Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24] > +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, > string_timestamp#13, timestamp_field#14, bool_field#15], false > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --
[jira] [Created] (SPARK-33300) Rule SimplifyCasts will not work for nested columns
chendihao created SPARK-33300: - Summary: Rule SimplifyCasts will not work for nested columns Key: SPARK-33300 URL: https://issues.apache.org/jira/browse/SPARK-33300 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: chendihao We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast. The SQL "select cast(string_date as string) from t1" will be optimized. ``` == Analyzed Logical Plan == string_date: string Project [cast(string_date#12 as string) AS string_date#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false ``` However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1". ``` == Analyzed Logical Plan == CAST(CAST(string_date AS STRING) AS STRING): string Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- SubqueryAlias t1 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false == Optimized Logical Plan == Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24] +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223508#comment-17223508 ] Apache Spark commented on SPARK-33248: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30202 > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0 > > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33297) Intermittent Compilation failure In GitHub Actions after SBT upgrade
[ https://issues.apache.org/jira/browse/SPARK-33297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33297. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30198 [https://github.com/apache/spark/pull/30198] > Intermittent Compilation failure In GitHub Actions after SBT upgrade > > > Key: SPARK-33297 > URL: https://issues.apache.org/jira/browse/SPARK-33297 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > https://github.com/apache/spark/runs/1314691686 > {code} > Error: java.util.MissingResourceException: Can't find bundle for base name > org.scalactic.ScalacticBundle, locale en > Error:at > java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581) > Error:at > java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396) > Error:at java.util.ResourceBundle.getBundle(ResourceBundle.java:782) > Error:at > org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8) > Error:at org.scalactic.Resources$.resourceBundle(Resources.scala:8) > Error:at > org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256) > Error:at > org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65) > Error:at > org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85) > Error:at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > Error:at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Error:at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-33183: Fix Version/s: 3.0.2 2.4.8 > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33294) Add query resolved check before analyze InsertIntoDir
[ https://issues.apache.org/jira/browse/SPARK-33294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33294: --- Assignee: ulysses you > Add query resolved check before analyze InsertIntoDir > - > > Key: SPARK-33294 > URL: https://issues.apache.org/jira/browse/SPARK-33294 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > Add query resolved check before analyze InsertIntoDir. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33294) Add query resolved check before analyze InsertIntoDir
[ https://issues.apache.org/jira/browse/SPARK-33294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33294. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30197 [https://github.com/apache/spark/pull/30197] > Add query resolved check before analyze InsertIntoDir > - > > Key: SPARK-33294 > URL: https://issues.apache.org/jira/browse/SPARK-33294 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > > Add query resolved check before analyze InsertIntoDir. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs
[ https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33299: Assignee: Apache Spark > Unify schema parsing in from_json/from_csv across all APIs > -- > > Key: SPARK-33299 > URL: https://issues.apache.org/jira/browse/SPARK-33299 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Currently, from_json() has extra capability in Scala API. It accepts schema > in JSON format but other API (SQL, Python, R) lacks the feature. The ticket > aims to unify all APIs, and support schemas in JSON format everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs
[ https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33299: Assignee: (was: Apache Spark) > Unify schema parsing in from_json/from_csv across all APIs > -- > > Key: SPARK-33299 > URL: https://issues.apache.org/jira/browse/SPARK-33299 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, from_json() has extra capability in Scala API. It accepts schema > in JSON format but other API (SQL, Python, R) lacks the feature. The ticket > aims to unify all APIs, and support schemas in JSON format everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs
[ https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223485#comment-17223485 ] Apache Spark commented on SPARK-33299: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30201 > Unify schema parsing in from_json/from_csv across all APIs > -- > > Key: SPARK-33299 > URL: https://issues.apache.org/jira/browse/SPARK-33299 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, from_json() has extra capability in Scala API. It accepts schema > in JSON format but other API (SQL, Python, R) lacks the feature. The ticket > aims to unify all APIs, and support schemas in JSON format everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs
Maxim Gekk created SPARK-33299: -- Summary: Unify schema parsing in from_json/from_csv across all APIs Key: SPARK-33299 URL: https://issues.apache.org/jira/browse/SPARK-33299 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, from_json() has extra capability in Scala API. It accepts schema in JSON format but other API (SQL, Python, R) lacks the feature. The ticket aims to unify all APIs, and support schemas in JSON format everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33166) Provide Search Function in Spark docs site
[ https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223466#comment-17223466 ] jiaan.geng commented on SPARK-33166: I will take a look! > Provide Search Function in Spark docs site > -- > > Key: SPARK-33166 > URL: https://issues.apache.org/jira/browse/SPARK-33166 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Xiao Li >Priority: Major > > In the last few releases, our Spark documentation > https://spark.apache.org/docs/latest/ becomes richer. It would nice to > provide a search function to make our users find contents faster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org