Raviteja Lokineni created SPARK-18388: -----------------------------------------
Summary: Running aggregation on many columns throws SOE Key: SPARK-18388 URL: https://issues.apache.org/jira/browse/SPARK-18388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1, 1.6.2, 1.5.2 Environment: PySpark 2.0.1, Jupyter Reporter: Raviteja Lokineni Priority: Critical Usecase: I am generating weekly aggregates of every column of data {code:python} from pyspark.sql.window import Window from pyspark.sql.functions import * timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv") # Hive timestamp is interpreted as UNIX timestamp in seconds* days = lambda i: i * 86400 w = (Window() .partitionBy("id") .orderBy(col("dt").cast("timestamp").cast("long")) .rangeBetween(-days(6), 0)) cols = ["id", "dt"] skipCols = ["id", "dt"] for col in timeSeries.columns: if col in skipCols: continue cols.append(mean(col).over(w).alias("mean_7_"+col)) cols.append(count(col).over(w).alias("count_7_"+col)) cols.append(sum(col).over(w).alias("sum_7_"+col)) cols.append(min(col).over(w).alias("min_7_"+col)) cols.append(max(col).over(w).alias("max_7_"+col)) df = timeSeries.select(cols) df.orderBy('id', 'dt').write\ .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\ .save("file:///tmp/spark-bug-out.csv") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org