[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

Pete Baker (JIRA) Mon, 29 Aug 2016 09:37:45 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pete Baker updated SPARK-17297:
-------------------------------
    Description: 
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Options for a fix are:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.
* Support for both absolute and relative window lengths may be added

  was:
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Either:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.


> window function generates unexpected results due to startTime being relative 
> to UTC
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-17297
>                 URL: https://issues.apache.org/jira/browse/SPARK-17297
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Pete Baker
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks go forward/back, one day is 23h long, 1 
> day is 25h long.
> The result of this is that, for daylight savings time, some rows within 1h of 
> midnight are aggregated to the wrong day.
> Options for a fix are:
> * This is the expected behavior, and the window() function should not be used 
> for this type of aggregation with a long window length and the {{window}} 
> function documentation should be updated as such, or
> * The window function should respect timezones and work on the assumption 
> that {{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the 
> local timezone, rather than UTC.
> * Support for both absolute and relative window lengths may be added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

Reply via email to