Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between 
consecutive events, using lag(event_time) over window -


SELECT device_id,
       unix_time,
       event_id,
       unix_time - lag(unix_time)
          OVER
        (PARTITION BY device_id ORDER BY unix_time,event_id)
         AS seconds_since_last_event
FROM ios_d_events;

This is giving me some strange results in the case where the first two events 
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by 
lag():

SELECT device_id,
       event_time,
       unix_time,
       event_id,
       lag(event_time) OVER (PARTITION BY device_id ORDER BY 
unix_time,event_id) AS lag_time
FROM ios_d_events;

I’m seeing that in these cases, I am getting something like 1970-01-03 … 
instead of a null value, and the following lag times are all following the same 
format.

I posted a section of this output in this SO question: 
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls

The errant results are labeled with device_id 99999999999999999999999.

Any idea why this is occurring?

- Ross

Reply via email to