Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai
Hi Ross,

What version of spark are you using? There were two issues that affected
the results of window function in Spark 1.5 branch. Both of issues have
been fixed and will be released with Spark 1.5.2 (this release will happen
soon). For more details of these two issues, you can take a look at
https://issues.apache.org/jira/browse/SPARK-11135 and
https://issues.apache.org/jira/browse/SPARK-11009.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM,  wrote:

> Hello Spark community -
> I am running a Spark SQL query to calculate the difference in time between
> consecutive events, using lag(event_time) over window -
>
> SELECT device_id,
>unix_time,
>event_id,
>unix_time - lag(unix_time)
>   OVER
> (PARTITION BY device_id ORDER BY unix_time,event_id)
>  AS seconds_since_last_event
> FROM ios_d_events;
>
> This is giving me some strange results in the case where the first two
> events for a particular device_id have the same timestamp.
> I used to following query to take a look at what value was being returned
> by lag():
>
> SELECT device_id,
>event_time,
>unix_time,
>event_id,
>lag(event_time) OVER (PARTITION BY device_id ORDER BY 
> unix_time,event_id) AS lag_time
> FROM ios_d_events;
>
> I’m seeing that in these cases, I am getting something like 1970-01-03 …
> instead of a null value, and the following lag times are all following the
> same format.
>
> I posted a section of this output in this SO question:
> http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls
>
> The errant results are labeled with device_id 999.
>
> Any idea why this is occurring?
>
> - Ross
>


Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
I am using Spark 1.5.0 on Yarn

On Nov 2, 2015, at 3:16 PM, Yin Huai 
> wrote:

Hi Ross,

What version of spark are you using? There were two issues that affected the 
results of window function in Spark 1.5 branch. Both of issues have been fixed 
and will be released with Spark 1.5.2 (this release will happen soon). For more 
details of these two issues, you can take a look at 
https://issues.apache.org/jira/browse/SPARK-11135
 and 
https://issues.apache.org/jira/browse/SPARK-11009.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM, 
> 
wrote:
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between 
consecutive events, using lag(event_time) over window -


SELECT device_id,
   unix_time,
   event_id,
   unix_time - lag(unix_time)
  OVER
(PARTITION BY device_id ORDER BY unix_time,event_id)
 AS seconds_since_last_event
FROM ios_d_events;

This is giving me some strange results in the case where the first two events 
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by 
lag():

SELECT device_id,
   event_time,
   unix_time,
   event_id,
   lag(event_time) OVER (PARTITION BY device_id ORDER BY 
unix_time,event_id) AS lag_time
FROM ios_d_events;

I’m seeing that in these cases, I am getting something like 1970-01-03 … 
instead of a null value, and the following lag times are all following the same 
format.

I posted a section of this output in this SO question: 
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls

The errant results are labeled with device_id 999.

Any idea why this is occurring?

- Ross