[jira] [Commented] (SPARK-29606) Improve EliminateOuterJoin performance

2020-10-30 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224007#comment-17224007
 ] 

Asif commented on SPARK-29606:
--

Have proposed following PR which completely solves the issue.

 

[PR-33152|https://github.com/apache/spark/pull/30185]

> Improve EliminateOuterJoin performance
> --
>
> Key: SPARK-29606
> URL: https://issues.apache.org/jira/browse/SPARK-29606
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `big_table1`(`adj_type_id` tinyint, `byr_cntry_id` 
> decimal(4,0), `sap_category_id` decimal(9,0), `lstg_site_id` decimal(9,0), 
> `lstg_type_code` decimal(4,0), `offrd_slng_chnl_grp_id` smallint, 
> `slr_cntry_id` decimal(4,0), `sold_slng_chnl_grp_id` smallint, 
> `bin_lstg_yn_id` tinyint, `bin_sold_yn_id` tinyint, `lstg_curncy_id` 
> decimal(4,0), `blng_curncy_id` decimal(4,0), `bid_count` decimal(18,0), 
> `ck_trans_count` decimal(18,0), `ended_bid_count` decimal(18,0), 
> `new_lstg_count` decimal(18,0), `ended_lstg_count` decimal(18,0), 
> `ended_success_lstg_count` decimal(18,0), `item_sold_count` decimal(18,0), 
> `gmv_us_amt` decimal(18,2), `gmv_byr_lc_amt` decimal(18,2), `gmv_slr_lc_amt` 
> decimal(18,2), `gmv_lstg_curncy_amt` decimal(18,2), `gmv_us_m_amt` 
> decimal(18,2), `rvnu_insrtn_fee_us_amt` decimal(18,6), 
> `rvnu_insrtn_fee_lc_amt` decimal(18,6), `rvnu_insrtn_fee_bc_amt` 
> decimal(18,6), `rvnu_insrtn_fee_us_m_amt` decimal(18,6), 
> `rvnu_insrtn_crd_us_amt` decimal(18,6), `rvnu_insrtn_crd_lc_amt` 
> decimal(18,6), `rvnu_insrtn_crd_bc_amt` decimal(18,6), 
> `rvnu_insrtn_crd_us_m_amt` decimal(18,6), `rvnu_fetr_fee_us_amt` 
> decimal(18,6), `rvnu_fetr_fee_lc_amt` decimal(18,6), `rvnu_fetr_fee_bc_amt` 
> decimal(18,6), `rvnu_fetr_fee_us_m_amt` decimal(18,6), `rvnu_fetr_crd_us_amt` 
> decimal(18,6), `rvnu_fetr_crd_lc_amt` decimal(18,6), `rvnu_fetr_crd_bc_amt` 
> decimal(18,6), `rvnu_fetr_crd_us_m_amt` decimal(18,6), `rvnu_fv_fee_us_amt` 
> decimal(18,6), `rvnu_fv_fee_slr_lc_amt` decimal(18,6), 
> `rvnu_fv_fee_byr_lc_amt` decimal(18,6), `rvnu_fv_fee_bc_amt` decimal(18,6), 
> `rvnu_fv_fee_us_m_amt` decimal(18,6), `rvnu_fv_crd_us_amt` decimal(18,6), 
> `rvnu_fv_crd_byr_lc_amt` decimal(18,6), `rvnu_fv_crd_slr_lc_amt` 
> decimal(18,6), `rvnu_fv_crd_bc_amt` decimal(18,6), `rvnu_fv_crd_us_m_amt` 
> decimal(18,6), `rvnu_othr_l_fee_us_amt` decimal(18,6), 
> `rvnu_othr_l_fee_lc_amt` decimal(18,6), `rvnu_othr_l_fee_bc_amt` 
> decimal(18,6), `rvnu_othr_l_fee_us_m_amt` decimal(18,6), 
> `rvnu_othr_l_crd_us_amt` decimal(18,6), `rvnu_othr_l_crd_lc_amt` 
> decimal(18,6), `rvnu_othr_l_crd_bc_amt` decimal(18,6), 
> `rvnu_othr_l_crd_us_m_amt` decimal(18,6), `rvnu_othr_nl_fee_us_amt` 
> decimal(18,6), `rvnu_othr_nl_fee_lc_amt` decimal(18,6), 
> `rvnu_othr_nl_fee_bc_amt` decimal(18,6), `rvnu_othr_nl_fee_us_m_amt` 
> decimal(18,6), `rvnu_othr_nl_crd_us_amt` decimal(18,6), 
> `rvnu_othr_nl_crd_lc_amt` decimal(18,6), `rvnu_othr_nl_crd_bc_amt` 
> decimal(18,6), `rvnu_othr_nl_crd_us_m_amt` decimal(18,6), 
> `rvnu_slr_tools_fee_us_amt` decimal(18,6), `rvnu_slr_tools_fee_lc_amt` 
> decimal(18,6), `rvnu_slr_tools_fee_bc_amt` decimal(18,6), 
> `rvnu_slr_tools_fee_us_m_amt` decimal(18,6), `rvnu_slr_tools_crd_us_amt` 
> decimal(18,6), `rvnu_slr_tools_crd_lc_amt` decimal(18,6), 
> `rvnu_slr_tools_crd_bc_amt` decimal(18,6), `rvnu_slr_tools_crd_us_m_amt` 
> decimal(18,6), `rvnu_unasgnd_us_amt` decimal(18,6), `rvnu_unasgnd_lc_amt` 
> decimal(18,6), `rvnu_unasgnd_bc_amt` decimal(18,6), `rvnu_unasgnd_us_m_amt` 
> decimal(18,6), `rvnu_ad_fee_us_amt` decimal(18,6), `rvnu_ad_fee_lc_amt` 
> decimal(18,6), `rvnu_ad_fee_bc_amt` decimal(18,6), `rvnu_ad_fee_us_m_amt` 
> decimal(18,6), `rvnu_ad_crd_us_amt` decimal(18,6), `rvnu_ad_crd_lc_amt` 
> decimal(18,6), `rvnu_ad_crd_bc_amt` decimal(18,6), `rvnu_ad_crd_us_m_amt` 
> decimal(18,6), `rvnu_othr_ad_fee_us_amt` decimal(18,6), 
> `rvnu_othr_ad_fee_lc_amt` decimal(18,6), `rvnu_othr_ad_fee_bc_amt` 
> decimal(18,6), `rvnu_othr_ad_fee_us_m_amt` decimal(18,6), `cre_date` date, 
> `cre_user` string, `upd_date` timestamp, `upd_user` string, 
> `cmn_mtrc_summ_dt` date)
> |USING parquet
> PARTITIONED BY (`cmn_mtrc_summ_dt`)
> |""".stripMargin)
> spark.sql(
>   """
> |CREATE TABLE `small_table1` (`CURNCY_ID` DECIMAL(9,0), 
> `CURNCY_PLAN_RATE` DECIMAL(18,6), `CRE_DATE` DATE, `CRE_USER` STRING, 
> `UPD_DATE` TIMESTAMP, `UPD_USER` STRING)
> |USING parquet
> |CLUSTERED BY (CURNCY_ID)
> |SORTED BY (CURNCY_ID)
> |INTO 1 BUCKETS
> |""".stripMargin)
> spark.sql(
>   """
> |CREATE TABLE `small_table2` (`cntry_id` DECIMAL(4,0), `curncy_id` 
> DECIMAL(4,0), `cntry_desc` STRIN

[jira] [Commented] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223992#comment-17223992
 ] 

Apache Spark commented on SPARK-33308:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30212

> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level
> --
>
> Key: SPARK-33308
> URL: https://issues.apache.org/jira/browse/SPARK-33308
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33308:


Assignee: (was: Apache Spark)

> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level
> --
>
> Key: SPARK-33308
> URL: https://issues.apache.org/jira/browse/SPARK-33308
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33308:


Assignee: Apache Spark

> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level
> --
>
> Key: SPARK-33308
> URL: https://issues.apache.org/jira/browse/SPARK-33308
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33229:
--
Parent: SPARK-33307
Issue Type: Sub-task  (was: Improvement)

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33233:
--
Parent: SPARK-33307
Issue Type: Sub-task  (was: Improvement)

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Now spark support group by ordinal, but cube/rollup/groupingsets not support 
> this. This pr make cube/rollup/grouping sets support group by ordinal



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33296) Format exception when use cube func && with cube

2020-10-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33296:
--
Parent: SPARK-33307
Issue Type: Sub-task  (was: Improvement)

> Format exception when use cube func && with cube
> 
>
> Key: SPARK-33296
> URL: https://issues.apache.org/jira/browse/SPARK-33296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> spark-sql> explain extended select a, b, c from x group by cube(a, b, c) with 
> cube;spark-sql> explain extended select a, b, c from x group by cube(a, b, c) 
> with cube;20/10/30 11:16:50 ERROR SparkSQLDriver: Failed in [explain extended 
> select a, b, c from x group by cube(a, b, c) with 
> cube]java.lang.UnsupportedOperationException at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:36)
>  at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:36)
>  at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:61) at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:269)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:285)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:285)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:285)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:184) at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:132)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:162)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>  at 
> org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:213)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:235)
>  at 
> org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:191)
>  at 
> org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:158) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33307) Refact GROUPING ANALYTICS

2020-10-30 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223990#comment-17223990
 ] 

angerszhu commented on SPARK-33307:
---

FYI [~maropu] [~cloud_fan]

> Refact GROUPING ANALYTICS
> -
>
> Key: SPARK-33307
> URL: https://issues.apache.org/jira/browse/SPARK-33307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Support CUBE(...) and ROLLUP(...) in the parser level  and support GROUPING 
> SETS(...) as group by expression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33309) Replace origin GROUPING SETS with new expression

2020-10-30 Thread angerszhu (Jira)
angerszhu created SPARK-33309:
-

 Summary: Replace origin GROUPING SETS with new expression
 Key: SPARK-33309
 URL: https://issues.apache.org/jira/browse/SPARK-33309
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu


Replace current GroupingSets just using expression
```
case class GroupingSets(
selectedGroupByExprs: Seq[Seq[Expression]],
groupByExprs: Seq[Expression],
child: LogicalPlan,
aggregations: Seq[NamedExpression]) extends UnaryNode {

  override def output: Seq[Attribute] = aggregations.map(_.toAttribute)

  // Needs to be unresolved before its translated to Aggregate + Expand because 
output attributes
  // will change in analysis.
  override lazy val resolved: Boolean = false
}
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2020-10-30 Thread angerszhu (Jira)
angerszhu created SPARK-33308:
-

 Summary: support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as 
group by expr in parser level
 Key: SPARK-33308
 URL: https://issues.apache.org/jira/browse/SPARK-33308
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu


support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33307) Refact GROUPING ANALYTICS

2020-10-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33307:
--
Description: Support CUBE(...) and ROLLUP(...) in the parser level  and 
support GROUPING SETS(...) as group by expression

> Refact GROUPING ANALYTICS
> -
>
> Key: SPARK-33307
> URL: https://issues.apache.org/jira/browse/SPARK-33307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Support CUBE(...) and ROLLUP(...) in the parser level  and support GROUPING 
> SETS(...) as group by expression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33307) Refact GROUPING ANALYTICS

2020-10-30 Thread angerszhu (Jira)
angerszhu created SPARK-33307:
-

 Summary: Refact GROUPING ANALYTICS
 Key: SPARK-33307
 URL: https://issues.apache.org/jira/browse/SPARK-33307
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33305:


Assignee: (was: Apache Spark)

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33305:


Assignee: Apache Spark

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223974#comment-17223974
 ] 

Apache Spark commented on SPARK-33305:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30211

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33306) TimezoneID is needed when there cast from Date to String

2020-10-30 Thread EdisonWang (Jira)
EdisonWang created SPARK-33306:
--

 Summary: TimezoneID is needed when there cast from Date to String
 Key: SPARK-33306
 URL: https://issues.apache.org/jira/browse/SPARK-33306
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33150) Groupby key may not be unique when using window

2020-10-30 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223971#comment-17223971
 ] 

Aoyuan Liao commented on SPARK-33150:
-

[~DieterDP] The time difference is passed via fold attribute. As you can see, 
self.collect() is equal to output_collect in your code.

[https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.py#L133]

>From this step, two tz-native datetimes are treated the same:
{code:java}
>>> print(output_collect)
[Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), 
Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)]
>>> pdf = pd.DataFrame.from_records(output_collect, columns=output.columns)
>>> pdf
   window  min_value
0 2019-10-27 02:54:00  1
1 2019-10-27 02:54:00  3
>>> pdf.dtypes
window   datetime64[ns]
min_value int64
dtype: object
{code}
If this function doesn't respect fold attribute,  we have to save array of fold 
values and use it to convert to machine timezone. It means every datetime entry 
needs to interate one by one again, I am not sure how it would affect the 
performance.  

As I said, there is a workaround to enable arrow in pyspark.
{code:java}
>>> spark = (pyspark
...  .sql
...  .SparkSession
...  .builder
...  .master('local[1]')
...  .config("spark.sql.session.timeZone", "UTC")
...  .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
...  .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') \
...  .config('spark.sql.execution.arrow.pyspark.enabled', True)
...  .getOrCreate()
... )
>>> output_collect = output.collect()
>>> output_pandas = output.toPandas()
>>> print(output_collect)
[Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), 
Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)]
>>> print(output_pandas)
   window  min_value
0 2019-10-27 00:54:00  1
1 2019-10-27 01:54:00  3

{code}

> Groupby key may not be unique when using window
> ---
>
> Key: SPARK-33150
> URL: https://issues.apache.org/jira/browse/SPARK-33150
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0
>Reporter: Dieter De Paepe
>Priority: Major
>
>  
> Due to the way spark converts dates to local times, it may end up losing 
> details that allow it to differentiate instants when those times fall in the 
> transition for daylight savings time. Setting the spark timezone to UTC does 
> not resolve the issue.
> This issue is somewhat related to SPARK-32123, but seems independent enough 
> to consider this a separate issue.
> A minimal example is below. I tested these on Spark 3.0.0 and 2.3.3 (I could 
> not get 2.4.x to work on my system). My machine is located in timezone 
> "Europe/Brussels".
>  
> {code:java}
> import pyspark
> import pyspark.sql.functions as f
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
>  .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC')
>  .getOrCreate()
> )
> debug_df = spark.createDataFrame([
>  (1572137640, 1),
>  (1572137640, 2),
>  (1572141240, 3),
>  (1572141240, 4)
> ],['epochtime', 'value'])
> debug_df \
>  .withColumn('time', f.from_unixtime('epochtime')) \
>  .withColumn('window', f.window('time', '1 minute').start) \
>  .collect()
> {code}
>  
> Output, here we see the window function internally transforms the times to 
> local time, and as such has to disambiguate between the Belgian winter and 
> summer hour transition by setting the "fold" attribute:
>  
> {code:java}
> [Row(epochtime=1572137640, value=1, time='2019-10-27 00:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54)),
>  Row(epochtime=1572137640, value=2, time='2019-10-27 00:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54)),
>  Row(epochtime=1572141240, value=3, time='2019-10-27 01:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54, fold=1)),
>  Row(epochtime=1572141240, value=4, time='2019-10-27 01:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54, fold=1))]{code}
>  
> Now, this has severe implications when we use the window function for a 
> groupby operation:
>  
> {code:java}
> output = debug_df \
>  .withColumn('time', f.from_unixtime('epochtime')) \
>  .groupby(f.window('time', '1 minute').start.alias('window')).agg(
>f.min('value').alias('min_value')
>  )
> output_collect = output.collect()
> output_pandas = output.toPandas()
> print(output_collect)
> print(output_pandas)
> {code}
> Output:
>  
> {code:java}
> [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), 
> Row(window=datetime.datetime(2019, 10, 27, 2, 54, f

[jira] [Created] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-10-30 Thread Chao Sun (Jira)
Chao Sun created SPARK-33305:


 Summary: DSv2: DROP TABLE command should also invalidate cache
 Key: SPARK-33305
 URL: https://issues.apache.org/jira/browse/SPARK-33305
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Chao Sun


Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
table but doesn't invalidate all caches referencing the table. We should make 
the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-10-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-32037.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Thomas Graves
>Priority: Minor
> Fix For: 3.1.0
>
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-10-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-32037:
-

Assignee: Thomas Graves

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Thomas Graves
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29683) Job failed due to executor failures all available nodes are blacklisted

2020-10-30 Thread Dongwook Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223929#comment-17223929
 ] 

Dongwook Kwon commented on SPARK-29683:
---

I agree with Genmao, these logics that added by SPARK-16630 is too strong 
condition to make application fail.

 
{code:java}
def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
numClusterNodes
val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ 
allocatorBlacklist.keySet
{code}
 

I think the above logic would work only for partial failure or intermittent 
issue that numClusterNodes isn't changed, in case of a scheduler or allocator 
failure become permanent failure that ResourceManager has to remove from its 
pool which numClusterNodes is changed, the above logic could fail.

e.g) let's say a cluster has 2 NodeManagers(numClusterNodes = 2), and one 
NodeManager(N1) has the some issues that cause scheduling failures which ends 
up increasing schedulerBlacklist.size to 1, and later N1 can't recover from 
ResourceManager's perspective due to a hardware failure or decommissioned by 
operator or any other ways, in this case numClusterNodes becomes 1 which makes 
isAllNodeBlacklisted true, even if there is still 1 NodeManager available and 
"spark.yarn.blacklist.executor.launch.blacklisting.enabled" set to false

Particularly in cloud environment, resizing of cluster happens all the times, 
for long-running spark application with many resize operations of cluster, 
schedulerBlacklist.size could keep increasing while numClusterNodes keep 
fluctuated, in addition even if currentBlacklistedYarnNodes.size >= 
numClusterNodes is true case, there could be new nodes would be added quickly. 

I found 
[e70df2cea46f71461d8d401a420e946f999862c1|https://github.com/apache/spark/commit/e70df2cea46f71461d8d401a420e946f999862c1]
 was added to handle the case of numClusterNode = 0.

However for other cases as mentioned in this JIRA, I think just removing the 
following part from 
[ApplicationMaster|https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L535-L538]
 would more make sense, because isAllNodeBlackListed doesn't necessarily mean 
running application needs to fail.

 
{code:java}
} else if (allocator.isAllNodeBlacklisted) {
 finish(FinalApplicationStatus.FAILED,
 ApplicationMaster.EXIT_MAX_EXECUTOR_FAILURES,
 "Due to executor failures all available nodes are blacklisted")
{code}
 

Or at least the above condition should apply as optional with 
"spark.yarn.blacklist.executor.launch.blacklisting.enabled" or some new 
configuration because SPARK-16630 added as optional but the above logic impact 
regardless of any configuration.

I wonder other's opinion about this.

 

> Job failed due to executor failures all available nodes are blacklisted
> ---
>
> Key: SPARK-29683
> URL: https://issues.apache.org/jira/browse/SPARK-29683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Priority: Major
>
> My streaming job will fail *due to executor failures all available nodes are 
> blacklisted*. This exception is thrown only when all node is blacklisted:
> {code:java}
> def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
> numClusterNodes
> val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ 
> allocatorBlacklist.keySet
> {code}
> After diving into the code, I found some critical conditions not be handled 
> properly:
>  - unchecked `excludeNodes`: it comes from user config. If not set properly, 
> it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For 
> example, we may set some nodes not in Yarn cluster.
> {code:java}
> excludeNodes = (invalid1, invalid2, invalid3)
> clusterNodes = (valid1, valid2)
> {code}
>  - `numClusterNodes` may equals 0: When HA Yarn failover, it will take some 
> time for all NodeManagers to register ResourceManager again. In this case, 
> `numClusterNode` may equals 0 or some other number, and Spark driver failed.
>  - too strong condition check: Spark driver will fail as long as 
> "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should 
> not indicate a unrecovered fatal. For example, there are some NodeManagers 
> restarting. So we can give some waiting time before job failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223925#comment-17223925
 ] 

Apache Spark commented on SPARK-33259:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30210

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Critical
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMetadataQuery = sessi

[jira] [Assigned] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33259:


Assignee: (was: Apache Spark)

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Critical
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
> .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
> which query

[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223924#comment-17223924
 ] 

Apache Spark commented on SPARK-33259:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30210

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Critical
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMetadataQuery = sessi

[jira] [Assigned] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33259:


Assignee: Apache Spark

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Assignee: Apache Spark
>Priority: Critical
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
> .select(lit("sessionStartsWithMetadata"), col("*")) // Add la

[jira] [Closed] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-32661.
-

> Spark executors on K8S do not request extra memory for off-heap allocations
> ---
>
> Key: SPARK-32661
> URL: https://issues.apache.org/jira/browse/SPARK-32661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Luca Canali
>Priority: Minor
>
> Off-heap memory allocations are configured using 
> `spark.memory.offHeap.enabled=true` and `conf 
> spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory 
> size to the executor container resources. Spark on Kubernetes does not 
> request the allocation of the off-heap memory. Currently, this can be worked 
> around by using spark.executor.memoryOverhead to reserve memory for off-heap 
> allocations. This proposes make Spark on Kubernetes behave as in the case of 
> YARN, that is adding spark.memory.offHeap.size to the memory request for 
> executor containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32661.
---
Resolution: Duplicate

This is superseded by SPARK-33288.

> Spark executors on K8S do not request extra memory for off-heap allocations
> ---
>
> Key: SPARK-32661
> URL: https://issues.apache.org/jira/browse/SPARK-32661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Luca Canali
>Priority: Minor
>
> Off-heap memory allocations are configured using 
> `spark.memory.offHeap.enabled=true` and `conf 
> spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory 
> size to the executor container resources. Spark on Kubernetes does not 
> request the allocation of the off-heap memory. Currently, this can be worked 
> around by using spark.executor.memoryOverhead to reserve memory for off-heap 
> allocations. This proposes make Spark on Kubernetes behave as in the case of 
> YARN, that is adding spark.memory.offHeap.size to the memory request for 
> executor containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26888) Upgrade to Log4j 2.x

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26888.
-

> Upgrade to Log4j 2.x
> 
>
> Key: SPARK-26888
> URL: https://issues.apache.org/jira/browse/SPARK-26888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> 5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.
> Please see the related blog post: 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> Migration guide how to migrate from Log4j 1.x to 2.x: 
> https://logging.apache.org/log4j/2.x/manual/migration.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33261) Allow people to extend the pod feature steps

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33261:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Improvement)

> Allow people to extend the pod feature steps
> 
>
> Key: SPARK-33261
> URL: https://issues.apache.org/jira/browse/SPARK-33261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> While we allow people to specify pod templates, some deployments could 
> benefit from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33098) Explicit or implicit casts involving partition columns can sometimes result in a MetaException.

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223812#comment-17223812
 ] 

Apache Spark commented on SPARK-33098:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/30207

> Explicit or implicit casts involving partition columns can sometimes result 
> in a MetaException.
> ---
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following cases throw
> {{MetaException(message:Filtering is supported only on partition keys of type 
> string)}}
> {noformat}
> sql("create table test (a int) partitioned by (b int) stored as parquet")
> sql("insert into test values (1, 1), (1, 2), (2, 2)")
> // These throw MetaExceptions
> sql("select * from test where b in ('2')").show(false)
> sql("select * from test where cast(b as string) = '2'").show(false)
> sql("select * from test where cast(b as string) in ('2')").show(false)
> sql("select * from test where cast(b as string) in (2)").show(false)
> sql("select cast(b as string) as b from test where b in ('2')").show(false)
> sql("select cast(b as string) as b from test").filter("b = '2'").show(false) 
> // [1]
> sql("select cast(b as string) as b from test").filter("b in (2)").show(false) 
> // [2]
> sql("select cast(b as string) as b from test").filter("b in 
> ('2')").show(false)
> sql("select * from test where cast(b as string) > '1'").show(false)
> sql("select cast(b as string) b from test").filter("b > '1'").show(false) // 
> [3]
> // [1] but not sql("select cast(b as string) as b from test where b = 
> '2'").show(false)
> // [2] but not sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> // [3] but not sql("select cast(b as string) b from test where b > 
> '1'").show(false)
> {noformat}
> The message ("Filtering is supported only on partition keys of type string") 
> is misleading. Filter *is* supported on integer columns, for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223807#comment-17223807
 ] 

Apache Spark commented on SPARK-33262:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30205

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33261) Allow people to extend the pod feature steps

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223806#comment-17223806
 ] 

Apache Spark commented on SPARK-33261:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30206

> Allow people to extend the pod feature steps
> 
>
> Key: SPARK-33261
> URL: https://issues.apache.org/jira/browse/SPARK-33261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> While we allow people to specify pod templates, some deployments could 
> benefit from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33261) Allow people to extend the pod feature steps

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33261:


Assignee: (was: Apache Spark)

> Allow people to extend the pod feature steps
> 
>
> Key: SPARK-33261
> URL: https://issues.apache.org/jira/browse/SPARK-33261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> While we allow people to specify pod templates, some deployments could 
> benefit from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33261) Allow people to extend the pod feature steps

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33261:


Assignee: Apache Spark

> Allow people to extend the pod feature steps
> 
>
> Key: SPARK-33261
> URL: https://issues.apache.org/jira/browse/SPARK-33261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> While we allow people to specify pod templates, some deployments could 
> benefit from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33259:
--
Priority: Critical  (was: Major)

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Critical
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
> .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
> which query's out

[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223779#comment-17223779
 ] 

Dongjoon Hyun commented on SPARK-33259:
---

Is there a way to block this at the analysis stage instead of giving an 
incorrect output?

cc [~viirya] and [~dbtsai]

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Major
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMet

[jira] [Commented] (SPARK-33301) Code grows beyond 64 KB when compiling large CaseWhen

2020-10-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223776#comment-17223776
 ] 

Dongjoon Hyun commented on SPARK-33301:
---

Thank you, [~yumwang].

> Code grows beyond 64 KB when compiling large CaseWhen
> -
>
> Key: SPARK-33301
> URL: https://issues.apache.org/jira/browse/SPARK-33301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, 
> AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET;
> SELECT T.*,
> CASE
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%4D%') THEN 'SPARK 4D'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%BASKETBALL%') THEN 'SPARK Basketball'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%BASKETBALL%') THEN 'SPARK Basketball'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN 
> 'EQT'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') 
> THEN 'Originals'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') 
> THEN 'Prophere'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', 
> '%1%') THEN 'Yung 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE 

[jira] [Closed] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-33279.
-

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33280) Spark 3.0 serialization issue

2020-10-30 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223762#comment-17223762
 ] 

L. C. Hsieh commented on SPARK-33280:
-

Yeah, I thought you can work around with a {{lazy}} or {{transient}} on {{sc}}. 
It seems like ClosureCleaner cannot clean up {{sc}} in {{trait Stage}}. You can 
also try to inline the {{trait}} into {{SampleStage}}.

> Spark 3.0 serialization issue
> -
>
> Key: SPARK-33280
> URL: https://issues.apache.org/jira/browse/SPARK-33280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: OS: MacOS Catalina 10.15.7
> +Environment 1:+
> Spark: spark-3.0.1-bin-hadoop2.7
> code compiled with : scala 2.12
> +Environment 2:+
> spark-2.4.7-bin-hadoop2.7
> code compiled with: scala 2.12
> +Environment 3 (This succeeds):+
> spark-2.4.7-bin-hadoop2.7
> code compiled with: scala 2.11
> Reproducible on YARN cluster with 3.0.1
>Reporter: Kiran Kumar Joseph
>Priority: Major
>
> This code which does a simple RDD map operation and saves the data fails on 
> executor with a deserialization error.
> {noformat}
> package test.pkg
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> trait Stage  {
>   val sc = SparkContext.getOrCreate()
>   sc.setCheckpointDir(sc.getConf.get("spark.checkpoint.dir",
> "/tmp/checkpoints"))
>   val app = new App(sc)
> }
> class App(sc: SparkContext) {
>   def write(data: RDD[String]): Unit = {
> if (!data.isEmpty())
>   data.saveAsTextFile("/tmp/test-output")
>   }
> }
> object SampleStage extends Stage {
>   def main(args: Array[String]): Unit = {
> val sampleList = Seq("a", "b", "c")
> writeList(sampleList)
> sc.stop()
>   }
>   def writeList(data: Seq[String]) = {
> val dataRDD = sc.parallelize(data).map(x => x.toLowerCase())
> app.write(dataRDD)
>   }
> } {noformat}
>  
> If I change the line
> {noformat}
> val dataRDD = sc.parallelize(data).map(x => x.toLowerCase()) {noformat}
> to
> {noformat}
> val dataRDD = sc.parallelize(data) {noformat}
> there is no exception and job succeeds
> Sparksubmit command:
> {noformat}
> $SPARK_HOME/bin/spark-submit --master spark://masterhost:port --conf 
> spark.app.name=SampleStage --executor-memory 2g --num-executors 2 --class 
> test.pkg.SampleStage --deploy-mode client --conf spark.driver.cores=4 
> sampleStage.jar {noformat}
>  
> Executor stacktrace:
> {noformat}
> Driver stacktrace:
> 20/10/28 22:55:34 INFO DAGScheduler: Job 0 failed: isEmpty at App.scala:13, 
> took 1.931078 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 0.3 in stage 0.0 (TID 3, 192.168.86.20, executor 0): 
> java.lang.NoClassDefFoundError: test.pkg.SampleStage$ (initialization failure)
>   at 
> java.lang.J9VMInternals.initializationAlreadyFailed(J9VMInternals.java:98)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2258)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2247)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2247)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1746)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2472)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2394)
>   at 
> java.io.Objec

[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-10-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223761#comment-17223761
 ] 

Dongjoon Hyun commented on SPARK-33279:
---

Thank you, [~fan_li_ya] and [~hyukjin.kwon].

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33295) Upgrade ORC to 1.6.6

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33295:
--
Summary:  Upgrade ORC to 1.6.6  (was:  Upgrade ORC to 1.6.1)

>  Upgrade ORC to 1.6.6
> -
>
> Key: SPARK-33295
> URL: https://issues.apache.org/jira/browse/SPARK-33295
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Jaanai Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> support zstd compression algorithm for ORC format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33295) Upgrade ORC to 1.6.1

2020-10-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223754#comment-17223754
 ] 

Dongjoon Hyun commented on SPARK-33295:
---

Until ORC 1.6.5, it's technically infeasible.

>  Upgrade ORC to 1.6.1
> -
>
> Key: SPARK-33295
> URL: https://issues.apache.org/jira/browse/SPARK-33295
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Jaanai Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> support zstd compression algorithm for ORC format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33295) Upgrade ORC to 1.6.1

2020-10-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223753#comment-17223753
 ] 

Dongjoon Hyun commented on SPARK-33295:
---

Hi, [~jaanai]. Yes. I've been working on that and it's ready for Apache Spark 
`sql/core` module. 
- ORC-669
- ORC-670
- ORC-671
- ORC-676
- ORC-677

However, it turns out that Apache ORC broke many things at Apache Hive still. 
So, it doesn't work with Apache Spark `sql/hive` and `sql/hive-thriftserver` 
module.

>  Upgrade ORC to 1.6.1
> -
>
> Key: SPARK-33295
> URL: https://issues.apache.org/jira/browse/SPARK-33295
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Jaanai Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> support zstd compression algorithm for ORC format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33290) REFRESH TABLE should invalidate cache even though the table itself may not be cached

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33290:
--
Issue Type: Bug  (was: Improvement)

> REFRESH TABLE should invalidate cache even though the table itself may not be 
> cached
> 
>
> Key: SPARK-33290
> URL: https://issues.apache.org/jira/browse/SPARK-33290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> For the following example:
> {code}
> CREATE TABLE t ...;
> CREATE VIEW t1 AS SELECT * FROM t;
> REFRESH TABLE t
> {code}
> If t is cached, t1 will be invalidated. However if t is not cached as above, 
> the REFRESH command won't invalidate view t1. This could lead to incorrect 
> result if the view is used later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33290) REFRESH TABLE should invalidate cache even though the table itself may not be cached

2020-10-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33290:
--
Labels: correctness  (was: )

> REFRESH TABLE should invalidate cache even though the table itself may not be 
> cached
> 
>
> Key: SPARK-33290
> URL: https://issues.apache.org/jira/browse/SPARK-33290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: correctness
>
> For the following example:
> {code}
> CREATE TABLE t ...;
> CREATE VIEW t1 AS SELECT * FROM t;
> REFRESH TABLE t
> {code}
> If t is cached, t1 will be invalidated. However if t is not cached as above, 
> the REFRESH command won't invalidate view t1. This could lead to incorrect 
> result if the view is used later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32661) Spark executors on K8S do not request extra memory for off-heap allocations

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223727#comment-17223727
 ] 

Apache Spark commented on SPARK-32661:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/30204

> Spark executors on K8S do not request extra memory for off-heap allocations
> ---
>
> Key: SPARK-32661
> URL: https://issues.apache.org/jira/browse/SPARK-32661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Luca Canali
>Priority: Minor
>
> Off-heap memory allocations are configured using 
> `spark.memory.offHeap.enabled=true` and `conf 
> spark.memory.offHeap.size=`. Spark on YARN adds the off-heap memory 
> size to the executor container resources. Spark on Kubernetes does not 
> request the allocation of the off-heap memory. Currently, this can be worked 
> around by using spark.executor.memoryOverhead to reserve memory for off-heap 
> allocations. This proposes make Spark on Kubernetes behave as in the case of 
> YARN, that is adding spark.memory.offHeap.size to the memory request for 
> executor containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33288:


Assignee: Apache Spark

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223723#comment-17223723
 ] 

Apache Spark commented on SPARK-33288:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/30204

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33288:


Assignee: (was: Apache Spark)

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33304) Add from_avro and to_avro functions to SparkR

2020-10-30 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33304:
--

 Summary: Add from_avro and to_avro functions to SparkR
 Key: SPARK-33304
 URL: https://issues.apache.org/jira/browse/SPARK-33304
 Project: Spark
  Issue Type: Improvement
  Components: R, SQL
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


{{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 3.0, 
but are still missing in R API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223674#comment-17223674
 ] 

Apache Spark commented on SPARK-33303:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/30203

> Deduplicate deterministic UDF calls
> ---
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Major
>
> We run into an issue where a customer created a column with an expensive UDF 
> call and build a very complex logic on the the top of that column as new 
> derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the 
> UDF is called ~1000 times for each row which degraded the performance of the 
> query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33303:


Assignee: Apache Spark

> Deduplicate deterministic UDF calls
> ---
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> We run into an issue where a customer created a column with an expensive UDF 
> call and build a very complex logic on the the top of that column as new 
> derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the 
> UDF is called ~1000 times for each row which degraded the performance of the 
> query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223675#comment-17223675
 ] 

Apache Spark commented on SPARK-33303:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/30203

> Deduplicate deterministic UDF calls
> ---
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Major
>
> We run into an issue where a customer created a column with an expensive UDF 
> call and build a very complex logic on the the top of that column as new 
> derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the 
> UDF is called ~1000 times for each row which degraded the performance of the 
> query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33303:


Assignee: Apache Spark

> Deduplicate deterministic UDF calls
> ---
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> We run into an issue where a customer created a column with an expensive UDF 
> call and build a very complex logic on the the top of that column as new 
> derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the 
> UDF is called ~1000 times for each row which degraded the performance of the 
> query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33303:


Assignee: (was: Apache Spark)

> Deduplicate deterministic UDF calls
> ---
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Major
>
> We run into an issue where a customer created a column with an expensive UDF 
> call and build a very complex logic on the the top of that column as new 
> derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the 
> UDF is called ~1000 times for each row which degraded the performance of the 
> query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33303) Deduplicate deterministic UDF calls

2020-10-30 Thread Peter Toth (Jira)
Peter Toth created SPARK-33303:
--

 Summary: Deduplicate deterministic UDF calls
 Key: SPARK-33303
 URL: https://issues.apache.org/jira/browse/SPARK-33303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Peter Toth


We run into an issue where a customer created a column with an expensive UDF 
call and build a very complex logic on the the top of that column as new 
derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` rules the UDF 
is called ~1000 times for each row which degraded the performance of the query 
significantly.

The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
avoid performance degradation.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32591) Add better api docs for stage level scheduling Resources

2020-10-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32591:
--
Description: 
A question came up when we added offheap memory to be able to set in a 
ResourceProfile executor resources.

[https://github.com/apache/spark/pull/28972/]

Based on that discussion we should add better api docs to explain what each one 
does. Perhaps point to the corresponding configuration .  Also specify the 
default config is used if not specified in the profiles for those that fallback.

  was:
A question came up when we added offheap memory to be able to set in a 
ResourceProfile executor resources.

[https://github.com/apache/spark/pull/28972/]

Based on that discussion we should add better api docs to explain what each one 
does. Perhaps point to the corresponding configuration .


> Add better api docs for stage level scheduling Resources
> 
>
> Key: SPARK-32591
> URL: https://issues.apache.org/jira/browse/SPARK-32591
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> A question came up when we added offheap memory to be able to set in a 
> ResourceProfile executor resources.
> [https://github.com/apache/spark/pull/28972/]
> Based on that discussion we should add better api docs to explain what each 
> one does. Perhaps point to the corresponding configuration .  Also specify 
> the default config is used if not specified in the profiles for those that 
> fallback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33302) Failed to push filters through Expand

2020-10-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33302:

Issue Type: Improvement  (was: Bug)

> Failed to push filters through Expand
> -
>
> Key: SPARK-33302
> URL: https://issues.apache.org/jira/browse/SPARK-33302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) 
> using parquet;
> create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;
> SELECT
>years,
>appversion,   
>SUM(uusers) AS users  
> FROM   (SELECT
>Date_trunc('year', dt)  AS years,
>CASE  
>  WHEN h.pid = 3 THEN 'iOS'   
>  WHEN h.pid = 4 THEN 'Android'   
>  ELSE 'Other'
>END AS viewport,  
>h.vsAS appversion,
>Count(DISTINCT u.uid)   AS uusers
>,Count(DISTINCT u.suid) AS srcusers
> FROM   SPARK_33302_1 u   
>join SPARK_33302_2 h  
>  ON h.uid = u.uid
> GROUP  BY 1, 
>   2, 
>   3) AS a
> WHERE  viewport = 'iOS'  
> GROUP  BY 1, 
>   2
> {code}
> {noformat}
> == Physical Plan ==
> *(5) HashAggregate(keys=[years#30, appversion#32], 
> functions=[sum(uusers#33L)])
> +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
>+- *(4) HashAggregate(keys=[years#30, appversion#32], 
> functions=[partial_sum(uusers#33L)])
>   +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
> u.`uid`#47 else null)])
>  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
> +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
> 1)) u.`uid`#47 else null)])
>+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
> functions=[])
>   +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
> AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
> true, [id=#241]
>  +- *(2) HashAggregate(keys=[date_trunc('year', 
> CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
> (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, 
> u.`suid`#48, gid#44], functions=[])
> +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' 
> WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
>+- *(2) Expand [ArrayBuffer(date_trunc(year, 
> cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS 
> WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
> ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
> WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
> vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
> CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
> 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
>   +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
> vs#12]
>  +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
> Inner, BuildRight
> :- *(2) Project [uid#7, dt#9, suid#10]
> :  +- *(2) Filter isnotnull(uid#7)
>   

[jira] [Updated] (SPARK-33302) Failed to push filters through Expand

2020-10-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33302:

Description: 
How to reproduce this issue:

{code:sql}
create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) using 
parquet;
create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;

SELECT
   years,
   appversion,   
   SUM(uusers) AS users  
FROM   (SELECT
   Date_trunc('year', dt)  AS years,
   CASE  
 WHEN h.pid = 3 THEN 'iOS'   
 WHEN h.pid = 4 THEN 'Android'   
 ELSE 'Other'
   END AS viewport,  
   h.vsAS appversion,
   Count(DISTINCT u.uid)   AS uusers
   ,Count(DISTINCT u.suid) AS srcusers
FROM   SPARK_33302_1 u   
   join SPARK_33302_2 h  
 ON h.uid = u.uid
GROUP  BY 1, 
  2, 
  3) AS a
WHERE  viewport = 'iOS'  
GROUP  BY 1, 
  2
{code}


{noformat}
== Physical Plan ==
*(5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)])
+- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
   +- *(4) HashAggregate(keys=[years#30, appversion#32], 
functions=[partial_sum(uusers#33L)])
  +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
u.`uid`#47 else null)])
 +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
1)) u.`uid`#47 else null)])
   +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
functions=[])
  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
true, [id=#241]
 +- *(2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` 
AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
functions=[])
+- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
(h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
   +- *(2) Expand [ArrayBuffer(date_trunc(year, 
cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN 
(pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
  +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
vs#12]
 +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
Inner, BuildRight
:- *(2) Project [uid#7, dt#9, suid#10]
:  +- *(2) Filter isnotnull(uid#7)
: +- *(2) ColumnarToRow
:+- FileScan parquet 
default.spark_33301_1[uid#7,dt#9,suid#10] Batched: true, DataFilters: 
[isnotnull(uid#7)], Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/spark_33301_1],
 PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: 
struct
+- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), 
[id=#233]
   +- *

[jira] [Updated] (SPARK-33302) Failed to push down filters through Expand

2020-10-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33302:

Summary: Failed to push down filters through Expand  (was: Failed to push 
filters through Expand)

> Failed to push down filters through Expand
> --
>
> Key: SPARK-33302
> URL: https://issues.apache.org/jira/browse/SPARK-33302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) 
> using parquet;
> create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;
> SELECT
>years,
>appversion,   
>SUM(uusers) AS users  
> FROM   (SELECT
>Date_trunc('year', dt)  AS years,
>CASE  
>  WHEN h.pid = 3 THEN 'iOS'   
>  WHEN h.pid = 4 THEN 'Android'   
>  ELSE 'Other'
>END AS viewport,  
>h.vsAS appversion,
>Count(DISTINCT u.uid)   AS uusers
>,Count(DISTINCT u.suid) AS srcusers
> FROM   SPARK_33302_1 u   
>join SPARK_33302_2 h  
>  ON h.uid = u.uid
> GROUP  BY 1, 
>   2, 
>   3) AS a
> WHERE  viewport = 'iOS'  
> GROUP  BY 1, 
>   2
> {code}
> {noformat}
> == Physical Plan ==
> *(5) HashAggregate(keys=[years#30, appversion#32], 
> functions=[sum(uusers#33L)])
> +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
>+- *(4) HashAggregate(keys=[years#30, appversion#32], 
> functions=[partial_sum(uusers#33L)])
>   +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
> u.`uid`#47 else null)])
>  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
> +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
> 1)) u.`uid`#47 else null)])
>+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
> functions=[])
>   +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
> AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
> true, [id=#241]
>  +- *(2) HashAggregate(keys=[date_trunc('year', 
> CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
> (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, 
> u.`suid`#48, gid#44], functions=[])
> +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' 
> WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
>+- *(2) Expand [ArrayBuffer(date_trunc(year, 
> cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS 
> WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
> ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
> WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
> vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
> CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
> 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
>   +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
> vs#12]
>  +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
> Inner, BuildRight
> :- *(2) Project [uid#7, dt#9, suid#10]
>   

[jira] [Created] (SPARK-33302) Failed to push filters through Expand

2020-10-30 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33302:
---

 Summary: Failed to push filters through Expand
 Key: SPARK-33302
 URL: https://issues.apache.org/jira/browse/SPARK-33302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.3.4, 3.1.0
Reporter: Yuming Wang


How to reproduce this issue:

{code:sql}
create table SPARK_33301_1(pid int, uid int, sid int, dt date, suid int) using 
parquet;
create table SPARK_33301_2(pid int, vs int, uid int, csid int) using parquet;

SELECT
   years,
   appversion,   
   SUM(uusers) AS users  
FROM   (SELECT
   Date_trunc('year', dt)  AS years,
   CASE  
 WHEN h.pid = 3 THEN 'iOS'   
 WHEN h.pid = 4 THEN 'Android'   
 ELSE 'Other'
   END AS viewport,  
   h.vsAS appversion,
   Count(DISTINCT u.uid)   AS uusers
   ,Count(DISTINCT u.suid) AS srcusers
FROM   SPARK_33301_1 u   
   join SPARK_33301_2 h  
 ON h.uid = u.uid
GROUP  BY 1, 
  2, 
  3) AS a
WHERE  viewport = 'iOS'  
GROUP  BY 1, 
  2
{code}


{noformat}
== Physical Plan ==
*(5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)])
+- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
   +- *(4) HashAggregate(keys=[years#30, appversion#32], 
functions=[partial_sum(uusers#33L)])
  +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
u.`uid`#47 else null)])
 +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
1)) u.`uid`#47 else null)])
   +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
functions=[])
  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
true, [id=#241]
 +- *(2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` 
AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
functions=[])
+- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
(h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
   +- *(2) Expand [ArrayBuffer(date_trunc(year, 
cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN 
(pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
  +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
vs#12]
 +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
Inner, BuildRight
:- *(2) Project [uid#7, dt#9, suid#10]
:  +- *(2) Filter isnotnull(uid#7)
: +- *(2) ColumnarToRow
:+- FileScan parquet 
default.spark_33301_1[uid#7,dt#9,suid#10] Batched: true, DataFilters: 
[isnotnull(uid#7)], Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/spark_33301_1],
 PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: 
struct

[jira] [Updated] (SPARK-33301) Code grows beyond 64 KB when compiling large CaseWhen

2020-10-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33301:

Summary: Code grows beyond 64 KB when compiling large CaseWhen  (was: Code 
grows beyond 64 KB when compiling large CASE_WHEN)

> Code grows beyond 64 KB when compiling large CaseWhen
> -
>
> Key: SPARK-33301
> URL: https://issues.apache.org/jira/browse/SPARK-33301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, 
> AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET;
> SELECT T.*,
> CASE
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%4D%') THEN 'SPARK 4D'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%BASKETBALL%') THEN 'SPARK Basketball'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
> '%BASKETBALL%') THEN 'SPARK Basketball'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN 
> 'EQT'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
> 'NMD'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') 
> THEN 'Originals'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
> '%SPARK%') THEN 'Other SPARK'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
> '%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') 
> THEN 'Prophere'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
> '%BOOST%') THEN 'Ultra Boost'
> WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', 
> '%1%') THEN 'Yung 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
> '%FORCE%', '%1%') THEN 'Air Force 1'
> WHE

[jira] [Commented] (SPARK-33301) Code grows beyond 64 KB when compiling large CASE_WHEN

2020-10-30 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223603#comment-17223603
 ] 

Yuming Wang commented on SPARK-33301:
-


{noformat}
20/10/30 05:01:40 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.InternalCompilerException: Compiling 
"GeneratedClass" in File 'generated.java', Line 1, Column 1: Compiling 
"project_doConsume_0(UTF8String project_expr_0_0, boolean 
project_exprIsNull_0_0, UTF8String project_expr_1_0, boolean 
project_exprIsNull_1_0, UTF8String project_expr_2_0, boolean 
project_exprIsNull_2_0, UTF8String project_expr_3_0, boolean 
project_exprIsNull_3_0, UTF8String project_expr_4_0, boolean 
project_exprIsNull_4_0, UTF8String project_expr_5_0, boolean 
project_exprIsNull_5_0)": Code grows beyond 64 KB
org.codehaus.commons.compiler.InternalCompilerException: Compiling 
"GeneratedClass" in File 'generated.java', Line 1, Column 1: Compiling 
"project_doConsume_0(UTF8String project_expr_0_0, boolean 
project_exprIsNull_0_0, UTF8String project_expr_1_0, boolean 
project_exprIsNull_1_0, UTF8String project_expr_2_0, boolean 
project_exprIsNull_2_0, UTF8String project_expr_3_0, boolean 
project_exprIsNull_3_0, UTF8String project_expr_4_0, boolean 
project_exprIsNull_4_0, UTF8String project_expr_5_0, boolean 
project_exprIsNull_5_0)": Code grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:358)
at org.codehaus.janino.UnitCompiler.access$000(UnitCompiler.java:231)
at 
org.codehaus.janino.UnitCompiler$1.visitCompilationUnit(UnitCompiler.java:322)
at 
org.codehaus.janino.UnitCompiler$1.visitCompilationUnit(UnitCompiler.java:319)
at org.codehaus.janino.Java$CompilationUnit.accept(Java.java:367)
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:319)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:278)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:272)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:252)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:82)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1389)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1487)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1484)
at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:719)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:718)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:382)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:67)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkS

[jira] [Created] (SPARK-33301) Code grows beyond 64 KB when compiling large CASE_WHEN

2020-10-30 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33301:
---

 Summary: Code grows beyond 64 KB when compiling large CASE_WHEN
 Key: SPARK-33301
 URL: https://issues.apache.org/jira/browse/SPARK-33301
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:sql}
CREATE TABLE TRANS_BASE (BRAND STRING, SUB_BRAND STRING, MODEL STRING, 
AUCT_TITL STRING, PRODUCT_LINE STRING, MODELS STRING) USING PARQUET;

SELECT T.*,
CASE
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
'%4D%') THEN 'SPARK 4D'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
'%BASKETBALL%') THEN 'SPARK Basketball'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%SPARK%', 
'%BASKETBALL%') THEN 'SPARK Basketball'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%EQT%') THEN 
'EQT'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%NMD%') THEN 
'NMD'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ORIGINALS%') 
THEN 'Originals'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%OTHER%', 
'%SPARK%') THEN 'Other SPARK'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
'%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
'%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PHARELL%', 
'%HUMAN%', '%RACE%') THEN 'Pharell Human Race'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%PROPHERE%') 
THEN 'Prophere'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
'%BOOST%') THEN 'Ultra Boost'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
'%BOOST%') THEN 'Ultra Boost'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
'%BOOST%') THEN 'Ultra Boost'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
'%BOOST%') THEN 'Ultra Boost'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%ULTRA%', 
'%BOOST%') THEN 'Ultra Boost'
WHEN Upper(BRAND) = 'SPARK' AND Upper(MODEL) LIKE ALL ('%YUNG%', '%1%') 
THEN 'Yung 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
'%FORCE%', '%1%') THEN 'Air Force 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
'%FORCE%', '%1%') THEN 'Air Force 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
'%FORCE%', '%1%') THEN 'Air Force 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
'%FORCE%', '%1%') THEN 'Air Force 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', 
'%FORCE%', '%1%') THEN 'Air Force 1'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') 
THEN 'Air Max'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') 
THEN 'Air Max'
WHEN Upper(BRAND) = 'HIVE' AND Upper(MODEL) LIKE ALL ('%AIR%', '%MAX%') 
THEN 'Air Max'
 

[jira] [Updated] (SPARK-33300) Rule SimplifyCasts will not work for nested columns

2020-10-30 Thread chendihao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chendihao updated SPARK-33300:
--
Description: 
We use SparkSQL and Catalyst to optimize the Spark job. We have read the source 
code and test the rule of SimplifyCasts which will work for simple SQL without 
nested cast.

The SQL "select cast(string_date as string) from t1" will be optimized.

{code:java}
== Analyzed Logical Plan ==
string_date: string
Project [cast(string_date#12 as string) AS string_date#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
{code}

However, it fail to optimize with the nested cast like this "select 
cast(cast(string_date as string) as string) from t1".

{code:java}
== Analyzed Logical Plan ==
CAST(CAST(string_date AS STRING) AS STRING): string
Project [cast(cast(string_date#12 as string) as string) AS 
CAST(CAST(string_date AS STRING) AS STRING)#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
{code}

  was:
We use SparkSQL and Catalyst to optimize the Spark job. We have read the source 
code and test the rule of SimplifyCasts which will work for simple SQL without 
nested cast.

The SQL "select cast(string_date as string) from t1" will be optimized.

```
== Analyzed Logical Plan ==
string_date: string
Project [cast(string_date#12 as string) AS string_date#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
```

However, it fail to optimize with the nested cast like this "select 
cast(cast(string_date as string) as string) from t1".

```
== Analyzed Logical Plan ==
CAST(CAST(string_date AS STRING) AS STRING): string
Project [cast(cast(string_date#12 as string) as string) AS 
CAST(CAST(string_date AS STRING) AS STRING)#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
```

 


> Rule SimplifyCasts will not work for nested columns
> ---
>
> Key: SPARK-33300
> URL: https://issues.apache.org/jira/browse/SPARK-33300
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: chendihao
>Priority: Minor
>
> We use SparkSQL and Catalyst to optimize the Spark job. We have read the 
> source code and test the rule of SimplifyCasts which will work for simple SQL 
> without nested cast.
> The SQL "select cast(string_date as string) from t1" will be optimized.
> {code:java}
> == Analyzed Logical Plan ==
> string_date: string
> Project [cast(string_date#12 as string) AS string_date#24]
> +- SubqueryAlias t1
>  +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
> string_timestamp#13, timestamp_field#14, bool_field#15], false
> == Optimized Logical Plan ==
> Project [string_date#12]
> +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
> string_timestamp#13, timestamp_field#14, bool_field#15], false
> {code}
> However, it fail to optimize with the nested cast like this "select 
> cast(cast(string_date as string) as string) from t1".
> {code:java}
> == Analyzed Logical Plan ==
> CAST(CAST(string_date AS STRING) AS STRING): string
> Project [cast(cast(string_date#12 as string) as string) AS 
> CAST(CAST(string_date AS STRING) AS STRING)#24]
> +- SubqueryAlias t1
>  +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
> string_timestamp#13, timestamp_field#14, bool_field#15], false
> == Optimized Logical Plan ==
> Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
> +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
> string_timestamp#13, timestamp_field#14, bool_field#15], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Created] (SPARK-33300) Rule SimplifyCasts will not work for nested columns

2020-10-30 Thread chendihao (Jira)
chendihao created SPARK-33300:
-

 Summary: Rule SimplifyCasts will not work for nested columns
 Key: SPARK-33300
 URL: https://issues.apache.org/jira/browse/SPARK-33300
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 3.0.0
Reporter: chendihao


We use SparkSQL and Catalyst to optimize the Spark job. We have read the source 
code and test the rule of SimplifyCasts which will work for simple SQL without 
nested cast.

The SQL "select cast(string_date as string) from t1" will be optimized.

```
== Analyzed Logical Plan ==
string_date: string
Project [cast(string_date#12 as string) AS string_date#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
```

However, it fail to optimize with the nested cast like this "select 
cast(cast(string_date as string) as string) from t1".

```
== Analyzed Logical Plan ==
CAST(CAST(string_date AS STRING) AS STRING): string
Project [cast(cast(string_date#12 as string) as string) AS 
CAST(CAST(string_date AS STRING) AS STRING)#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, 
string_timestamp#13, timestamp_field#14, bool_field#15], false
```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223508#comment-17223508
 ] 

Apache Spark commented on SPARK-33248:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30202

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33297) Intermittent Compilation failure In GitHub Actions after SBT upgrade

2020-10-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33297.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30198
[https://github.com/apache/spark/pull/30198]

> Intermittent Compilation failure In GitHub Actions after SBT upgrade
> 
>
> Key: SPARK-33297
> URL: https://issues.apache.org/jira/browse/SPARK-33297
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> https://github.com/apache/spark/runs/1314691686
> {code}
> Error:  java.util.MissingResourceException: Can't find bundle for base name 
> org.scalactic.ScalacticBundle, locale en
> Error:at 
> java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)
> Error:at 
> java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396)
> Error:at java.util.ResourceBundle.getBundle(ResourceBundle.java:782)
> Error:at 
> org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8)
> Error:at org.scalactic.Resources$.resourceBundle(Resources.scala:8)
> Error:at 
> org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256)
> Error:at 
> org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65)
> Error:at 
> org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85)
> Error:at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
> Error:at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Error:at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33183:

Fix Version/s: 3.0.2
   2.4.8

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33294) Add query resolved check before analyze InsertIntoDir

2020-10-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33294:
---

Assignee: ulysses you

> Add query resolved check before analyze InsertIntoDir
> -
>
> Key: SPARK-33294
> URL: https://issues.apache.org/jira/browse/SPARK-33294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> Add query resolved check before analyze InsertIntoDir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33294) Add query resolved check before analyze InsertIntoDir

2020-10-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33294.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30197
[https://github.com/apache/spark/pull/30197]

> Add query resolved check before analyze InsertIntoDir
> -
>
> Key: SPARK-33294
> URL: https://issues.apache.org/jira/browse/SPARK-33294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add query resolved check before analyze InsertIntoDir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33299:


Assignee: Apache Spark

> Unify schema parsing in from_json/from_csv across all APIs
> --
>
> Key: SPARK-33299
> URL: https://issues.apache.org/jira/browse/SPARK-33299
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, from_json() has extra capability in Scala API. It accepts schema 
> in JSON format but other API (SQL, Python, R) lacks the feature. The ticket 
> aims to unify all APIs, and support schemas in JSON format everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs

2020-10-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33299:


Assignee: (was: Apache Spark)

> Unify schema parsing in from_json/from_csv across all APIs
> --
>
> Key: SPARK-33299
> URL: https://issues.apache.org/jira/browse/SPARK-33299
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, from_json() has extra capability in Scala API. It accepts schema 
> in JSON format but other API (SQL, Python, R) lacks the feature. The ticket 
> aims to unify all APIs, and support schemas in JSON format everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs

2020-10-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223485#comment-17223485
 ] 

Apache Spark commented on SPARK-33299:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30201

> Unify schema parsing in from_json/from_csv across all APIs
> --
>
> Key: SPARK-33299
> URL: https://issues.apache.org/jira/browse/SPARK-33299
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, from_json() has extra capability in Scala API. It accepts schema 
> in JSON format but other API (SQL, Python, R) lacks the feature. The ticket 
> aims to unify all APIs, and support schemas in JSON format everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs

2020-10-30 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33299:
--

 Summary: Unify schema parsing in from_json/from_csv across all APIs
 Key: SPARK-33299
 URL: https://issues.apache.org/jira/browse/SPARK-33299
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, from_json() has extra capability in Scala API. It accepts schema in 
JSON format but other API (SQL, Python, R) lacks the feature. The ticket aims 
to unify all APIs, and support schemas in JSON format everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33166) Provide Search Function in Spark docs site

2020-10-30 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223466#comment-17223466
 ] 

jiaan.geng commented on SPARK-33166:


I will take a look!

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org