Hi Koert,

the physical plan looks like it is doing the right thing:

partitioned table hdfs://user/koert/test, read date from the directory
names, hash partitioned and agg the date to find distinct date. Finally
shuffle the dates for sort and limit 1 operations.

This is my understanding of the physical plan, you can navigate the actual
execution in the web UI to see how much data is actually read to satisfy
this request. I hope it only requires a few bytes for few dates.

Best Regards,

Jerry


On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote:

> I agreed the max date will satisfy the latest date requirement but it does
> not satisfy the second last date requirement you mentioned.
>
> Just for your information, before you invested in the partitioned table
> too much, I want to warn you that it has memory issues (both on executors
> and driver side). A simple experiment can show that if you have over 10
> years of date (3650 directories), it takes a long time to initialize. I got
> to know the limitation after I tried to partition user events per their
> user_id. It was a disaster (>10000 user_id).
>
> I hope the spark developer can address the memory limitations because
> partitioned table is very useful in many cases.
>
> Cheers ~
>
>
>
> On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i was going for the distinct approach, since i want it to be general
>> enough to also solve other related problems later. the max-date is likely
>> to be faster though.
>>
>> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Koert,
>>>
>>> You should be able to see if it requires scanning the whole data by
>>> "explain" the query. The physical plan should say something about it. I
>>> wonder if you are trying the distinct-sort-by-limit approach or the
>>> max-date approach?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> it seems pretty fast, but if i have 2 partitions and 10mm records i do
>>>> have to dedupe (distinct) 10mm records
>>>>
>>>> a direct way to just find out what the 2 partitions are would be much
>>>> faster. spark knows it, but its not exposed.
>>>>
>>>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> it seems to work but i am not sure if its not scanning the whole
>>>>> dataset. let me dig into tasks a a bit
>>>>>
>>>>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Koert,
>>>>>>
>>>>>> If the partitioned table is implemented properly, I would think
>>>>>> "select distinct(date) as dt from table order by dt DESC limit 1" would
>>>>>> return the latest dates without scanning the whole dataset. I haven't try
>>>>>> it that myself. It would be great if you can report back if this actually
>>>>>> works or not. :)
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hello all,
>>>>>>> i am trying to get familiar with spark sql partitioning support.
>>>>>>>
>>>>>>> my data is partitioned by date, so like this:
>>>>>>> data/date=2015-01-01
>>>>>>> data/date=2015-01-02
>>>>>>> data/date=2015-01-03
>>>>>>> ...
>>>>>>>
>>>>>>> lets say i would like a batch process to read data for the latest
>>>>>>> date only. how do i proceed?
>>>>>>> generally the latest date will be yesterday, but it could be a day
>>>>>>> older or maybe 2.
>>>>>>>
>>>>>>> i understand that i will have to do something like:
>>>>>>> df.filter(df("date") === some_date_string_here)
>>>>>>>
>>>>>>> however i do now know what some_date_string_here should be. i would
>>>>>>> like to inspect the available dates and pick the latest. is there an
>>>>>>> efficient way to  find out what the available partitions are?
>>>>>>>
>>>>>>> thanks! koert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to