I try to use spark sql built in window function: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String)
I run it with step=1 seconds and window = 3 minutes (ratio of 180) and it runs extremely slow compared to other methods (join & filter for example) Example code: Dataset: +-------+-------------------+ |data |timestamp | +-------+-------------------+ |data1|2017-12-28 11:23:10| |data1|2017-12-28 11:23:11| |data1|2017-12-28 11:23:19| |data2|2017-12-28 11:23:13| |data2|2017-12-28 11:23:14| +-------+-------------------+ And a third column of features which doesn't show here. Code: private static String TIME_STEP_STRING = "1 seconds"; private static String TIME_WINDOW_STRING = "3 minutes"; Column slidingWindow = functions.window(data.col("timestamp"), TIME_WINDOW_STRING, TIME_STEP_STRING); Dataset<Row> data2 = data.withColumn("slide", slidingWindow); Dataset<Row> finalRes = data2.groupBy(slidingWindow, data2.col("data")).agg(functions.collect_set("features").as("feature_set")).cache(); Am I using it wrong? the situation is so bad I get java.lang.OutOfMemoryError: Java heap space