Hi Koert, the physical plan looks like it is doing the right thing:
partitioned table hdfs://user/koert/test, read date from the directory names, hash partitioned and agg the date to find distinct date. Finally shuffle the dates for sort and limit 1 operations. This is my understanding of the physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote: > I agreed the max date will satisfy the latest date requirement but it does > not satisfy the second last date requirement you mentioned. > > Just for your information, before you invested in the partitioned table > too much, I want to warn you that it has memory issues (both on executors > and driver side). A simple experiment can show that if you have over 10 > years of date (3650 directories), it takes a long time to initialize. I got > to know the limitation after I tried to partition user events per their > user_id. It was a disaster (>10000 user_id). > > I hope the spark developer can address the memory limitations because > partitioned table is very useful in many cases. > > Cheers ~ > > > > On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i was going for the distinct approach, since i want it to be general >> enough to also solve other related problems later. the max-date is likely >> to be faster though. >> >> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> Hi Koert, >>> >>> You should be able to see if it requires scanning the whole data by >>> "explain" the query. The physical plan should say something about it. I >>> wonder if you are trying the distinct-sort-by-limit approach or the >>> max-date approach? >>> >>> Best Regards, >>> >>> Jerry >>> >>> >>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> it seems pretty fast, but if i have 2 partitions and 10mm records i do >>>> have to dedupe (distinct) 10mm records >>>> >>>> a direct way to just find out what the 2 partitions are would be much >>>> faster. spark knows it, but its not exposed. >>>> >>>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> it seems to work but i am not sure if its not scanning the whole >>>>> dataset. let me dig into tasks a a bit >>>>> >>>>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Koert, >>>>>> >>>>>> If the partitioned table is implemented properly, I would think >>>>>> "select distinct(date) as dt from table order by dt DESC limit 1" would >>>>>> return the latest dates without scanning the whole dataset. I haven't try >>>>>> it that myself. It would be great if you can report back if this actually >>>>>> works or not. :) >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Jerry >>>>>> >>>>>> >>>>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> >>>>>> wrote: >>>>>> >>>>>>> hello all, >>>>>>> i am trying to get familiar with spark sql partitioning support. >>>>>>> >>>>>>> my data is partitioned by date, so like this: >>>>>>> data/date=2015-01-01 >>>>>>> data/date=2015-01-02 >>>>>>> data/date=2015-01-03 >>>>>>> ... >>>>>>> >>>>>>> lets say i would like a batch process to read data for the latest >>>>>>> date only. how do i proceed? >>>>>>> generally the latest date will be yesterday, but it could be a day >>>>>>> older or maybe 2. >>>>>>> >>>>>>> i understand that i will have to do something like: >>>>>>> df.filter(df("date") === some_date_string_here) >>>>>>> >>>>>>> however i do now know what some_date_string_here should be. i would >>>>>>> like to inspect the available dates and pick the latest. is there an >>>>>>> efficient way to find out what the available partitions are? >>>>>>> >>>>>>> thanks! koert >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >