it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records
a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote: > it seems to work but i am not sure if its not scanning the whole dataset. > let me dig into tasks a a bit > > On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Koert, >> >> If the partitioned table is implemented properly, I would think "select >> distinct(date) as dt from table order by dt DESC limit 1" would return the >> latest dates without scanning the whole dataset. I haven't try it that >> myself. It would be great if you can report back if this actually works or >> not. :) >> >> Best Regards, >> >> Jerry >> >> >> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> hello all, >>> i am trying to get familiar with spark sql partitioning support. >>> >>> my data is partitioned by date, so like this: >>> data/date=2015-01-01 >>> data/date=2015-01-02 >>> data/date=2015-01-03 >>> ... >>> >>> lets say i would like a batch process to read data for the latest date >>> only. how do i proceed? >>> generally the latest date will be yesterday, but it could be a day older >>> or maybe 2. >>> >>> i understand that i will have to do something like: >>> df.filter(df("date") === some_date_string_here) >>> >>> however i do now know what some_date_string_here should be. i would like >>> to inspect the available dates and pick the latest. is there an efficient >>> way to find out what the available partitions are? >>> >>> thanks! koert >>> >>> >>> >> >