i was going for the distinct approach, since i want it to be general enough to also solve other related problems later. the max-date is likely to be faster though.
On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Koert, > > You should be able to see if it requires scanning the whole data by > "explain" the query. The physical plan should say something about it. I > wonder if you are trying the distinct-sort-by-limit approach or the > max-date approach? > > Best Regards, > > Jerry > > > On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> it seems pretty fast, but if i have 2 partitions and 10mm records i do >> have to dedupe (distinct) 10mm records >> >> a direct way to just find out what the 2 partitions are would be much >> faster. spark knows it, but its not exposed. >> >> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> it seems to work but i am not sure if its not scanning the whole >>> dataset. let me dig into tasks a a bit >>> >>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com> wrote: >>> >>>> Hi Koert, >>>> >>>> If the partitioned table is implemented properly, I would think "select >>>> distinct(date) as dt from table order by dt DESC limit 1" would return the >>>> latest dates without scanning the whole dataset. I haven't try it that >>>> myself. It would be great if you can report back if this actually works or >>>> not. :) >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>>> >>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> hello all, >>>>> i am trying to get familiar with spark sql partitioning support. >>>>> >>>>> my data is partitioned by date, so like this: >>>>> data/date=2015-01-01 >>>>> data/date=2015-01-02 >>>>> data/date=2015-01-03 >>>>> ... >>>>> >>>>> lets say i would like a batch process to read data for the latest date >>>>> only. how do i proceed? >>>>> generally the latest date will be yesterday, but it could be a day >>>>> older or maybe 2. >>>>> >>>>> i understand that i will have to do something like: >>>>> df.filter(df("date") === some_date_string_here) >>>>> >>>>> however i do now know what some_date_string_here should be. i would >>>>> like to inspect the available dates and pick the latest. is there an >>>>> efficient way to find out what the available partitions are? >>>>> >>>>> thanks! koert >>>>> >>>>> >>>>> >>>> >>> >> >