This is definitely a bug in the CollapseWindow optimizer rule. I think we can use SPARK-20086 <https://issues.apache.org/jira/browse/SPARK-20086> to track this.
On Fri, Mar 24, 2017 at 9:28 PM, Maciej Szymkiewicz <mszymkiew...@gmail.com> wrote: > Forwarded from SO (http://stackoverflow.com/q/43007433). Looks like > regression compared to 2.0.2. > > scala> import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.expressions.Window > > scala> val win_spec_max = > Window.partitionBy("x").orderBy("AmtPaid").rowsBetween(Window. > unboundedPreceding, > 0) > win_spec_max: org.apache.spark.sql.expressions.WindowSpec = > org.apache.spark.sql.expressions.WindowSpec@3433e418 > > scala> val df = Seq((1, 2.0), (1, 3.0), (1, 1.0), (1, -2.0), (1, > -1.0)).toDF("x", "AmtPaid") > df: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double] > > scala> val df_with_sum = df.withColumn("AmtPaidCumSum", > sum(col("AmtPaid")).over(win_spec_max)) > df_with_sum: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double > ... 1 more field] > > scala> val df_with_max = df_with_sum.withColumn("AmtPaidCumSumMax", > max(col("AmtPaidCumSum")).over(win_spec_max)) > df_with_max: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double > ... 2 more fields] > > scala> df_with_max.explain > == Physical Plan == > !Window [sum(AmtPaid#361) windowspecdefinition(x#360, AmtPaid#361 ASC > NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS > AmtPaidCumSum#366, max(AmtPaidCumSum#366) windowspecdefinition(x#360, > AmtPaid#361 ASC NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS AmtPaidCumSumMax#372], [x#360], [AmtPaid#361 ASC NULLS > FIRST] > +- *Sort [x#360 ASC NULLS FIRST, AmtPaid#361 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(x#360, 200) > +- LocalTableScan [x#360, AmtPaid#361] > > scala> df_with_max.printSchema > root > |-- x: integer (nullable = false) > |-- AmtPaid: double (nullable = false) > |-- AmtPaidCumSum: double (nullable = true) > |-- AmtPaidCumSumMax: double (nullable = true) > > scala> df_with_max.show > 17/03/24 21:22:32 ERROR Executor: Exception in task 0.0 in stage 19.0 > (TID 234) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: AmtPaidCumSum#366 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > ... > Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366 > in [sum#385,max#386,x#360,AmtPaid#361] > ... > > Is it a known issue or do we need a JIRA? > > -- > Best, > Maciej Szymkiewicz > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Herman van Hövell Software Engineer Databricks Inc. hvanhov...@databricks.com +31 6 420 590 27 databricks.com [image: http://databricks.com] <http://databricks.com/> [image: Join Databricks at Spark Summit 2017 in San Francisco, the world's largest event for the Apache Spark community.] <http://ssum.it/2mKQ3te>