it seems pretty fast, but if i have 2 partitions and 10mm records i do have
to dedupe (distinct) 10mm records

a direct way to just find out what the 2 partitions are would be much
faster. spark knows it, but its not exposed.

On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote:

> it seems to work but i am not sure if its not scanning the whole dataset.
> let me dig into tasks a a bit
>
> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Koert,
>>
>> If the partitioned table is implemented properly, I would think "select
>> distinct(date) as dt from table order by dt DESC limit 1" would return the
>> latest dates without scanning the whole dataset. I haven't try it that
>> myself. It would be great if you can report back if this actually works or
>> not. :)
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> hello all,
>>> i am trying to get familiar with spark sql partitioning support.
>>>
>>> my data is partitioned by date, so like this:
>>> data/date=2015-01-01
>>> data/date=2015-01-02
>>> data/date=2015-01-03
>>> ...
>>>
>>> lets say i would like a batch process to read data for the latest date
>>> only. how do i proceed?
>>> generally the latest date will be yesterday, but it could be a day older
>>> or maybe 2.
>>>
>>> i understand that i will have to do something like:
>>> df.filter(df("date") === some_date_string_here)
>>>
>>> however i do now know what some_date_string_here should be. i would like
>>> to inspect the available dates and pick the latest. is there an efficient
>>> way to  find out what the available partitions are?
>>>
>>> thanks! koert
>>>
>>>
>>>
>>
>

Reply via email to