I am using Spark 1.5.0 on Yarn

On Nov 2, 2015, at 3:16 PM, Yin Huai 
<yh...@databricks.com<mailto:yh...@databricks.com>> wrote:

Hi Ross,

What version of spark are you using? There were two issues that affected the 
results of window function in Spark 1.5 branch. Both of issues have been fixed 
and will be released with Spark 1.5.2 (this release will happen soon). For more 
details of these two issues, you can take a look at 
https://issues.apache.org/jira/browse/SPARK-11135<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D11135&d=AwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=hAVSdUzw2QCVpppdRQQZvGxLrweTQH7nQNJk94LRycM&s=FDk9t9GVT28ANhaJmOhxi8WEJhKoqSeDI-MwGtyaufQ&e=>
 and 
https://issues.apache.org/jira/browse/SPARK-11009<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D11009&d=AwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=hAVSdUzw2QCVpppdRQQZvGxLrweTQH7nQNJk94LRycM&s=qH_c1vah0F-aJynVihxEfKGKBtptMoLSShZiYCLGqts&e=>.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM, 
<ross.cramb...@thomsonreuters.com<mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between 
consecutive events, using lag(event_time) over window -


SELECT device_id,
       unix_time,
       event_id,
       unix_time - lag(unix_time)
          OVER
        (PARTITION BY device_id ORDER BY unix_time,event_id)
         AS seconds_since_last_event
FROM ios_d_events;

This is giving me some strange results in the case where the first two events 
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by 
lag():

SELECT device_id,
       event_time,
       unix_time,
       event_id,
       lag(event_time) OVER (PARTITION BY device_id ORDER BY 
unix_time,event_id) AS lag_time
FROM ios_d_events;

I’m seeing that in these cases, I am getting something like 1970-01-03 … 
instead of a null value, and the following lag times are all following the same 
format.

I posted a section of this output in this SO question: 
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls<https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_33482167_spark-2Dsql-2Dwindow-2Dfunction-2Dlag-2Dgiving-2Dunexpected-2Dresutls&d=AwMFaQ&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=DJcC0Gr3B6BfuPcycQUvAi5ueGCorF1rF8_kDa-hAYg&m=hAVSdUzw2QCVpppdRQQZvGxLrweTQH7nQNJk94LRycM&s=CBd66H1CZXUvg7UgyG-WjZfUVdZ1XEBJ07LK-3Iu8UM&e=>

The errant results are labeled with device_id 99999999999999999999999.

Any idea why this is occurring?

- Ross


Reply via email to