Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
good idea. with the dates sorting correctly alphabetically i should be able
to do something similar with strings

On Sun, Nov 1, 2015 at 4:06 PM, Jörn Franke  wrote:

> Try with max date, in your case it could make more sense to represent the
> date as int
>
> Sent from my iPhone
>
> On 01 Nov 2015, at 21:03, Koert Kuipers  wrote:
>
> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date, so like this:
> data/date=2015-01-01
> data/date=2015-01-02
> data/date=2015-01-03
> ...
>
> lets say i would like a batch process to read data for the latest date
> only. how do i proceed?
> generally the latest date will be yesterday, but it could be a day older
> or maybe 2.
>
> i understand that i will have to do something like:
> df.filter(df("date") === some_date_string_here)
>
> however i do now know what some_date_string_here should be. i would like
> to inspect the available dates and pick the latest. is there an efficient
> way to  find out what the available partitions are?
>
> thanks! koert
>
>
>


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
it seems pretty fast, but if i have 2 partitions and 10mm records i do have
to dedupe (distinct) 10mm records

a direct way to just find out what the 2 partitions are would be much
faster. spark knows it, but its not exposed.

On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers  wrote:

> it seems to work but i am not sure if its not scanning the whole dataset.
> let me dig into tasks a a bit
>
> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam  wrote:
>
>> Hi Koert,
>>
>> If the partitioned table is implemented properly, I would think "select
>> distinct(date) as dt from table order by dt DESC limit 1" would return the
>> latest dates without scanning the whole dataset. I haven't try it that
>> myself. It would be great if you can report back if this actually works or
>> not. :)
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers  wrote:
>>
>>> hello all,
>>> i am trying to get familiar with spark sql partitioning support.
>>>
>>> my data is partitioned by date, so like this:
>>> data/date=2015-01-01
>>> data/date=2015-01-02
>>> data/date=2015-01-03
>>> ...
>>>
>>> lets say i would like a batch process to read data for the latest date
>>> only. how do i proceed?
>>> generally the latest date will be yesterday, but it could be a day older
>>> or maybe 2.
>>>
>>> i understand that i will have to do something like:
>>> df.filter(df("date") === some_date_string_here)
>>>
>>> however i do now know what some_date_string_here should be. i would like
>>> to inspect the available dates and pick the latest. is there an efficient
>>> way to  find out what the available partitions are?
>>>
>>> thanks! koert
>>>
>>>
>>>
>>
>


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
I agreed the max date will satisfy the latest date requirement but it does
not satisfy the second last date requirement you mentioned.

Just for your information, before you invested in the partitioned table too
much, I want to warn you that it has memory issues (both on executors and
driver side). A simple experiment can show that if you have over 10 years
of date (3650 directories), it takes a long time to initialize. I got to
know the limitation after I tried to partition user events per their
user_id. It was a disaster (>1 user_id).

I hope the spark developer can address the memory limitations because
partitioned table is very useful in many cases.

Cheers ~



On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers  wrote:

> i was going for the distinct approach, since i want it to be general
> enough to also solve other related problems later. the max-date is likely
> to be faster though.
>
> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam  wrote:
>
>> Hi Koert,
>>
>> You should be able to see if it requires scanning the whole data by
>> "explain" the query. The physical plan should say something about it. I
>> wonder if you are trying the distinct-sort-by-limit approach or the
>> max-date approach?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers  wrote:
>>
>>> it seems pretty fast, but if i have 2 partitions and 10mm records i do
>>> have to dedupe (distinct) 10mm records
>>>
>>> a direct way to just find out what the 2 partitions are would be much
>>> faster. spark knows it, but its not exposed.
>>>
>>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers  wrote:
>>>
 it seems to work but i am not sure if its not scanning the whole
 dataset. let me dig into tasks a a bit

 On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam  wrote:

> Hi Koert,
>
> If the partitioned table is implemented properly, I would think
> "select distinct(date) as dt from table order by dt DESC limit 1" would
> return the latest dates without scanning the whole dataset. I haven't try
> it that myself. It would be great if you can report back if this actually
> works or not. :)
>
> Best Regards,
>
> Jerry
>
>
> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers 
> wrote:
>
>> hello all,
>> i am trying to get familiar with spark sql partitioning support.
>>
>> my data is partitioned by date, so like this:
>> data/date=2015-01-01
>> data/date=2015-01-02
>> data/date=2015-01-03
>> ...
>>
>> lets say i would like a batch process to read data for the latest
>> date only. how do i proceed?
>> generally the latest date will be yesterday, but it could be a day
>> older or maybe 2.
>>
>> i understand that i will have to do something like:
>> df.filter(df("date") === some_date_string_here)
>>
>> however i do now know what some_date_string_here should be. i would
>> like to inspect the available dates and pick the latest. is there an
>> efficient way to  find out what the available partitions are?
>>
>> thanks! koert
>>
>>
>>
>

>>>
>>
>


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert,

You should be able to see if it requires scanning the whole data by
"explain" the query. The physical plan should say something about it. I
wonder if you are trying the distinct-sort-by-limit approach or the
max-date approach?

Best Regards,

Jerry


On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers  wrote:

> it seems pretty fast, but if i have 2 partitions and 10mm records i do
> have to dedupe (distinct) 10mm records
>
> a direct way to just find out what the 2 partitions are would be much
> faster. spark knows it, but its not exposed.
>
> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers  wrote:
>
>> it seems to work but i am not sure if its not scanning the whole dataset.
>> let me dig into tasks a a bit
>>
>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam  wrote:
>>
>>> Hi Koert,
>>>
>>> If the partitioned table is implemented properly, I would think "select
>>> distinct(date) as dt from table order by dt DESC limit 1" would return the
>>> latest dates without scanning the whole dataset. I haven't try it that
>>> myself. It would be great if you can report back if this actually works or
>>> not. :)
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers  wrote:
>>>
 hello all,
 i am trying to get familiar with spark sql partitioning support.

 my data is partitioned by date, so like this:
 data/date=2015-01-01
 data/date=2015-01-02
 data/date=2015-01-03
 ...

 lets say i would like a batch process to read data for the latest date
 only. how do i proceed?
 generally the latest date will be yesterday, but it could be a day
 older or maybe 2.

 i understand that i will have to do something like:
 df.filter(df("date") === some_date_string_here)

 however i do now know what some_date_string_here should be. i would
 like to inspect the available dates and pick the latest. is there an
 efficient way to  find out what the available partitions are?

 thanks! koert



>>>
>>
>


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
i was going for the distinct approach, since i want it to be general enough
to also solve other related problems later. the max-date is likely to be
faster though.

On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam  wrote:

> Hi Koert,
>
> You should be able to see if it requires scanning the whole data by
> "explain" the query. The physical plan should say something about it. I
> wonder if you are trying the distinct-sort-by-limit approach or the
> max-date approach?
>
> Best Regards,
>
> Jerry
>
>
> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers  wrote:
>
>> it seems pretty fast, but if i have 2 partitions and 10mm records i do
>> have to dedupe (distinct) 10mm records
>>
>> a direct way to just find out what the 2 partitions are would be much
>> faster. spark knows it, but its not exposed.
>>
>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers  wrote:
>>
>>> it seems to work but i am not sure if its not scanning the whole
>>> dataset. let me dig into tasks a a bit
>>>
>>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam  wrote:
>>>
 Hi Koert,

 If the partitioned table is implemented properly, I would think "select
 distinct(date) as dt from table order by dt DESC limit 1" would return the
 latest dates without scanning the whole dataset. I haven't try it that
 myself. It would be great if you can report back if this actually works or
 not. :)

 Best Regards,

 Jerry


 On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers 
 wrote:

> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date, so like this:
> data/date=2015-01-01
> data/date=2015-01-02
> data/date=2015-01-03
> ...
>
> lets say i would like a batch process to read data for the latest date
> only. how do i proceed?
> generally the latest date will be yesterday, but it could be a day
> older or maybe 2.
>
> i understand that i will have to do something like:
> df.filter(df("date") === some_date_string_here)
>
> however i do now know what some_date_string_here should be. i would
> like to inspect the available dates and pick the latest. is there an
> efficient way to  find out what the available partitions are?
>
> thanks! koert
>
>
>

>>>
>>
>


spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
hello all,
i am trying to get familiar with spark sql partitioning support.

my data is partitioned by date, so like this:
data/date=2015-01-01
data/date=2015-01-02
data/date=2015-01-03
...

lets say i would like a batch process to read data for the latest date
only. how do i proceed?
generally the latest date will be yesterday, but it could be a day older or
maybe 2.

i understand that i will have to do something like:
df.filter(df("date") === some_date_string_here)

however i do now know what some_date_string_here should be. i would like to
inspect the available dates and pick the latest. is there an efficient way
to  find out what the available partitions are?

thanks! koert


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jörn Franke
Try with max date, in your case it could make more sense to represent the date 
as int 

Sent from my iPhone

> On 01 Nov 2015, at 21:03, Koert Kuipers  wrote:
> 
> hello all,
> i am trying to get familiar with spark sql partitioning support.
> 
> my data is partitioned by date, so like this:
> data/date=2015-01-01
> data/date=2015-01-02
> data/date=2015-01-03
> ...
> 
> lets say i would like a batch process to read data for the latest date only. 
> how do i proceed? 
> generally the latest date will be yesterday, but it could be a day older or 
> maybe 2. 
> 
> i understand that i will have to do something like:
> df.filter(df("date") === some_date_string_here)
> 
> however i do now know what some_date_string_here should be. i would like to 
> inspect the available dates and pick the latest. is there an efficient way to 
>  find out what the available partitions are?
> 
> thanks! koert
> 
> 


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert,

the physical plan looks like it is doing the right thing:

partitioned table hdfs://user/koert/test, read date from the directory
names, hash partitioned and agg the date to find distinct date. Finally
shuffle the dates for sort and limit 1 operations.

This is my understanding of the physical plan, you can navigate the actual
execution in the web UI to see how much data is actually read to satisfy
this request. I hope it only requires a few bytes for few dates.

Best Regards,

Jerry


On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam  wrote:

> I agreed the max date will satisfy the latest date requirement but it does
> not satisfy the second last date requirement you mentioned.
>
> Just for your information, before you invested in the partitioned table
> too much, I want to warn you that it has memory issues (both on executors
> and driver side). A simple experiment can show that if you have over 10
> years of date (3650 directories), it takes a long time to initialize. I got
> to know the limitation after I tried to partition user events per their
> user_id. It was a disaster (>1 user_id).
>
> I hope the spark developer can address the memory limitations because
> partitioned table is very useful in many cases.
>
> Cheers ~
>
>
>
> On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers  wrote:
>
>> i was going for the distinct approach, since i want it to be general
>> enough to also solve other related problems later. the max-date is likely
>> to be faster though.
>>
>> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam  wrote:
>>
>>> Hi Koert,
>>>
>>> You should be able to see if it requires scanning the whole data by
>>> "explain" the query. The physical plan should say something about it. I
>>> wonder if you are trying the distinct-sort-by-limit approach or the
>>> max-date approach?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers  wrote:
>>>
 it seems pretty fast, but if i have 2 partitions and 10mm records i do
 have to dedupe (distinct) 10mm records

 a direct way to just find out what the 2 partitions are would be much
 faster. spark knows it, but its not exposed.

 On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers 
 wrote:

> it seems to work but i am not sure if its not scanning the whole
> dataset. let me dig into tasks a a bit
>
> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam 
> wrote:
>
>> Hi Koert,
>>
>> If the partitioned table is implemented properly, I would think
>> "select distinct(date) as dt from table order by dt DESC limit 1" would
>> return the latest dates without scanning the whole dataset. I haven't try
>> it that myself. It would be great if you can report back if this actually
>> works or not. :)
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers 
>> wrote:
>>
>>> hello all,
>>> i am trying to get familiar with spark sql partitioning support.
>>>
>>> my data is partitioned by date, so like this:
>>> data/date=2015-01-01
>>> data/date=2015-01-02
>>> data/date=2015-01-03
>>> ...
>>>
>>> lets say i would like a batch process to read data for the latest
>>> date only. how do i proceed?
>>> generally the latest date will be yesterday, but it could be a day
>>> older or maybe 2.
>>>
>>> i understand that i will have to do something like:
>>> df.filter(df("date") === some_date_string_here)
>>>
>>> however i do now know what some_date_string_here should be. i would
>>> like to inspect the available dates and pick the latest. is there an
>>> efficient way to  find out what the available partitions are?
>>>
>>> thanks! koert
>>>
>>>
>>>
>>
>

>>>
>>
>


Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert,

If the partitioned table is implemented properly, I would think "select
distinct(date) as dt from table order by dt DESC limit 1" would return the
latest dates without scanning the whole dataset. I haven't try it that
myself. It would be great if you can report back if this actually works or
not. :)

Best Regards,

Jerry


On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers  wrote:

> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date, so like this:
> data/date=2015-01-01
> data/date=2015-01-02
> data/date=2015-01-03
> ...
>
> lets say i would like a batch process to read data for the latest date
> only. how do i proceed?
> generally the latest date will be yesterday, but it could be a day older
> or maybe 2.
>
> i understand that i will have to do something like:
> df.filter(df("date") === some_date_string_here)
>
> however i do now know what some_date_string_here should be. i would like
> to inspect the available dates and pick the latest. is there an efficient
> way to  find out what the available partitions are?
>
> thanks! koert
>
>
>