Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
good idea. with the dates sorting correctly alphabetically i should be able to do something similar with strings On Sun, Nov 1, 2015 at 4:06 PM, Jörn Franke wrote: > Try with max date, in your case it could make more sense to represent the > date as int > > Sent from my

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
I agreed the max date will satisfy the latest date requirement but it does not satisfy the second last date requirement you mentioned. Just for your information, before you invested in the partitioned table too much, I want to warn you that it has memory issues (both on executors and driver

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, You should be able to see if it requires scanning the whole data by "explain" the query. The physical plan should say something about it. I wonder if you are trying the distinct-sort-by-limit approach or the max-date approach? Best Regards, Jerry On Sun, Nov 1, 2015 at 4:25 PM,

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
i was going for the distinct approach, since i want it to be general enough to also solve other related problems later. the max-date is likely to be faster though. On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: > Hi Koert, > > You should be able to see if it requires

spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
hello all, i am trying to get familiar with spark sql partitioning support. my data is partitioned by date, so like this: data/date=2015-01-01 data/date=2015-01-02 data/date=2015-01-03 ... lets say i would like a batch process to read data for the latest date only. how do i proceed? generally

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jörn Franke
Try with max date, in your case it could make more sense to represent the date as int Sent from my iPhone > On 01 Nov 2015, at 21:03, Koert Kuipers wrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date,

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, the physical plan looks like it is doing the right thing: partitioned table hdfs://user/koert/test, read date from the directory names, hash partitioned and agg the date to find distinct date. Finally shuffle the dates for sort and limit 1 operations. This is my understanding of the

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually