Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually works or not. :)
Best Regards, Jerry On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote: > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=2015-01-02 > data/date=2015-01-03 > ... > > lets say i would like a batch process to read data for the latest date > only. how do i proceed? > generally the latest date will be yesterday, but it could be a day older > or maybe 2. > > i understand that i will have to do something like: > df.filter(df("date") === some_date_string_here) > > however i do now know what some_date_string_here should be. i would like > to inspect the available dates and pick the latest. is there an efficient > way to find out what the available partitions are? > > thanks! koert > > >