+1, we should just fix the error to explain why months aren't allowed and
suggest that you manually specify some number of days.

On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com>
wrote:

> Thanks for the response Burak,
>
> As any sane person I try to steer away from the objects which have both
> calendar and unsafe in their fully qualified names but if there is no
> bigger picture I missed here I would go with 1 as well. And of course fix
> the error message. I understand this has been introduced with structured
> streaming in mind, but it is an useful feature in general, not only in high
> precision scale. To be honest I would love to see some generalized version
> which could be used (I mean without hacking) with arbitrary numeric
> sequence. It could address at least some scenarios in which people try to
> use window functions without PARTITION BY clause and fail miserably.
>
> Regarding ambiguity... Sticking with days doesn't really resolve the
> problem, does it? If one were to nitpick it doesn't look like this
> implementation even touches all the subtleties of DST or leap second.
>
>
>
>
> On 01/18/2017 05:52 PM, Burak Yavuz wrote:
>
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix the
> implementation. I'll leave it to the community to comment on. The code
> right now disallows intervals provided in months and years, because they
> are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
> or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
> used in finance), 365 days or 366 days?
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> Can I ask for some clarifications regarding intended behavior of window /
>> TimeWindow?
>>
>> PySpark documentation states that "Windows in the order of months are not
>> supported". This is further confirmed by the checks in
>> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>>
>> Surprisingly enough we can pass interval much larger than a month by
>> expressing interval in days or another unit of a higher precision. So this
>> fails:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>>
>> while following is accepted:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>>
>> with results which look sensible at first glance.
>>
>> Is it a matter of a faulty validation logic (months will be assigned only
>> if there is a match against years or months https://git.io/vMPdi) or
>> expected behavior and I simply misunderstood the intentions?
>>
>> --
>> Best,
>> Maciej
>>
>>
>
>

Reply via email to