Hello Spark community - I am running a Spark SQL query to calculate the difference in time between consecutive events, using lag(event_time) over window -
SELECT device_id, unix_time, event_id, unix_time - lag(unix_time) OVER (PARTITION BY device_id ORDER BY unix_time,event_id) AS seconds_since_last_event FROM ios_d_events; This is giving me some strange results in the case where the first two events for a particular device_id have the same timestamp. I used to following query to take a look at what value was being returned by lag(): SELECT device_id, event_time, unix_time, event_id, lag(event_time) OVER (PARTITION BY device_id ORDER BY unix_time,event_id) AS lag_time FROM ios_d_events; I’m seeing that in these cases, I am getting something like 1970-01-03 … instead of a null value, and the following lag times are all following the same format. I posted a section of this output in this SO question: http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls The errant results are labeled with device_id 99999999999999999999999. Any idea why this is occurring? - Ross