good idea. with the dates sorting correctly alphabetically i should be able
to do something similar with strings
On Sun, Nov 1, 2015 at 4:06 PM, Jörn Franke wrote:
> Try with max date, in your case it could make more sense to represent the
> date as int
>
> Sent from my
it seems pretty fast, but if i have 2 partitions and 10mm records i do have
to dedupe (distinct) 10mm records
a direct way to just find out what the 2 partitions are would be much
faster. spark knows it, but its not exposed.
On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers
I agreed the max date will satisfy the latest date requirement but it does
not satisfy the second last date requirement you mentioned.
Just for your information, before you invested in the partitioned table too
much, I want to warn you that it has memory issues (both on executors and
driver
Hi Koert,
You should be able to see if it requires scanning the whole data by
"explain" the query. The physical plan should say something about it. I
wonder if you are trying the distinct-sort-by-limit approach or the
max-date approach?
Best Regards,
Jerry
On Sun, Nov 1, 2015 at 4:25 PM,
i was going for the distinct approach, since i want it to be general enough
to also solve other related problems later. the max-date is likely to be
faster though.
On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote:
> Hi Koert,
>
> You should be able to see if it requires
hello all,
i am trying to get familiar with spark sql partitioning support.
my data is partitioned by date, so like this:
data/date=2015-01-01
data/date=2015-01-02
data/date=2015-01-03
...
lets say i would like a batch process to read data for the latest date
only. how do i proceed?
generally
Try with max date, in your case it could make more sense to represent the date
as int
Sent from my iPhone
> On 01 Nov 2015, at 21:03, Koert Kuipers wrote:
>
> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date,
Hi Koert,
the physical plan looks like it is doing the right thing:
partitioned table hdfs://user/koert/test, read date from the directory
names, hash partitioned and agg the date to find distinct date. Finally
shuffle the dates for sort and limit 1 operations.
This is my understanding of the
Hi Koert,
If the partitioned table is implemented properly, I would think "select
distinct(date) as dt from table order by dt DESC limit 1" would return the
latest dates without scanning the whole dataset. I haven't try it that
myself. It would be great if you can report back if this actually