I agreed the max date will satisfy the latest date requirement but it does
not satisfy the second last date requirement you mentioned.

Just for your information, before you invested in the partitioned table too
much, I want to warn you that it has memory issues (both on executors and
driver side). A simple experiment can show that if you have over 10 years
of date (3650 directories), it takes a long time to initialize. I got to
know the limitation after I tried to partition user events per their
user_id. It was a disaster (>10000 user_id).

I hope the spark developer can address the memory limitations because
partitioned table is very useful in many cases.

Cheers ~



On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i was going for the distinct approach, since i want it to be general
> enough to also solve other related problems later. the max-date is likely
> to be faster though.
>
> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Koert,
>>
>> You should be able to see if it requires scanning the whole data by
>> "explain" the query. The physical plan should say something about it. I
>> wonder if you are trying the distinct-sort-by-limit approach or the
>> max-date approach?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> it seems pretty fast, but if i have 2 partitions and 10mm records i do
>>> have to dedupe (distinct) 10mm records
>>>
>>> a direct way to just find out what the 2 partitions are would be much
>>> faster. spark knows it, but its not exposed.
>>>
>>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> it seems to work but i am not sure if its not scanning the whole
>>>> dataset. let me dig into tasks a a bit
>>>>
>>>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>>
>>>>> Hi Koert,
>>>>>
>>>>> If the partitioned table is implemented properly, I would think
>>>>> "select distinct(date) as dt from table order by dt DESC limit 1" would
>>>>> return the latest dates without scanning the whole dataset. I haven't try
>>>>> it that myself. It would be great if you can report back if this actually
>>>>> works or not. :)
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> hello all,
>>>>>> i am trying to get familiar with spark sql partitioning support.
>>>>>>
>>>>>> my data is partitioned by date, so like this:
>>>>>> data/date=2015-01-01
>>>>>> data/date=2015-01-02
>>>>>> data/date=2015-01-03
>>>>>> ...
>>>>>>
>>>>>> lets say i would like a batch process to read data for the latest
>>>>>> date only. how do i proceed?
>>>>>> generally the latest date will be yesterday, but it could be a day
>>>>>> older or maybe 2.
>>>>>>
>>>>>> i understand that i will have to do something like:
>>>>>> df.filter(df("date") === some_date_string_here)
>>>>>>
>>>>>> however i do now know what some_date_string_here should be. i would
>>>>>> like to inspect the available dates and pick the latest. is there an
>>>>>> efficient way to  find out what the available partitions are?
>>>>>>
>>>>>> thanks! koert
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to