[ https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102386#comment-15102386 ]
Herman van Hovell commented on SPARK-12835: ------------------------------------------- I can reproduce your problem with the following scala code: {noformat} import java.sql.Date import org.apache.spark.sql.expressions.Window val df = Seq( (Date.valueOf("2014-01-01")), (Date.valueOf("2014-02-01")), (Date.valueOf("2014-03-01")), (Date.valueOf("2014-03-06")), (Date.valueOf("2014-08-23")), (Date.valueOf("2014-10-01"))). map(Tuple1.apply). toDF("ts") // This doesn't work: df.select(avg(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts"))))).show // This does work: df.select(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts"))).as("diff")) .select(avg($"diff")) .show {noformat} It seems there is a small bug in the analyzer. > StackOverflowError when aggregating over column from window function > -------------------------------------------------------------------- > > Key: SPARK-12835 > URL: https://issues.apache.org/jira/browse/SPARK-12835 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.0 > Reporter: Kalle Jepsen > > I am encountering a StackoverflowError with a very long traceback, when I try > to directly aggregate on a column created by a window function. > E.g. I am trying to determine the average timespan between dates in a > Dataframe column by using a window-function: > {code} > from pyspark import SparkContext > from pyspark.sql import HiveContext, Window, functions > from datetime import datetime > sc = SparkContext() > sq = HiveContext(sc) > data = [ > [datetime(2014,1,1)], > [datetime(2014,2,1)], > [datetime(2014,3,1)], > [datetime(2014,3,6)], > [datetime(2014,8,23)], > [datetime(2014,10,1)], > ] > df = sq.createDataFrame(data, schema=['ts']) > ts = functions.col('ts') > > w = Window.orderBy(ts) > diff = functions.datediff( > ts, > functions.lag(ts, count=1).over(w) > ) > avg_diff = functions.avg(diff) > {code} > While {{df.select(diff.alias('diff')).show()}} correctly renders as > {noformat} > +----+ > |diff| > +----+ > |null| > | 31| > | 28| > | 5| > | 170| > | 39| > +----+ > {noformat} > doing {code} > df.select(avg_diff).show() > {code} throws a {{java.lang.StackOverflowError}}. > When I say > {code} > df2 = df.select(diff.alias('diff')) > df2.select(functions.avg('diff')) > {code} > however, there's no error. > Am I wrong to assume that the above should work? > I've already described the same in [this question on > stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org