Pete Baker created SPARK-17297:
----------------------------------

             Summary: window function generates unexpected results due to 
startTime being relative to UTC
                 Key: SPARK-17297
                 URL: https://issues.apache.org/jira/browse/SPARK-17297
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Pete Baker


In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Either:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to