[ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446391#comment-15446391
 ] 

Sean Owen commented on SPARK-17297:
-----------------------------------

I don't think there's an assumption about what a day is in here, because 
underneath this is all done in terms of absolute time in microseconds (since 
the epoch). You would not be able to do aggregations whose window varied a bit 
to start and end on day boundaries with respect to some calendar with a 
straight window() call this way. I think you could propose a documentation 
update that notes that, for example, "1 day" just means "24*60*60*1000*1000" 
microseconds, not a day according to a calendar.

There are other ways to get the effect you need but will entail watching a 
somewhat larger window of time so you will always eventually find one batch 
interval where you see the whole day of data.

> window function generates unexpected results due to startTime being relative 
> to UTC
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-17297
>                 URL: https://issues.apache.org/jira/browse/SPARK-17297
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Pete Baker
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks go forward/back, one day is 23h long, 1 
> day is 25h long.
> The result of this is that, for daylight savings time, some rows within 1h of 
> midnight are aggregated to the wrong day.
> Options for a fix are:
> * This is the expected behavior, and the window() function should not be used 
> for this type of aggregation with a long window length and the {{window}} 
> function documentation should be updated as such, or
> * The window function should respect timezones and work on the assumption 
> that {{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the 
> local timezone, rather than UTC.
> * Support for both absolute and relative window lengths may be added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to