Re: spark sql partitioned by date... read last date
good idea. with the dates sorting correctly alphabetically i should be able to do something similar with strings On Sun, Nov 1, 2015 at 4:06 PM, Jörn Frankewrote: > Try with max date, in your case it could make more sense to represent the > date as int > > Sent from my iPhone > > On 01 Nov 2015, at 21:03, Koert Kuipers wrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=2015-01-02 > data/date=2015-01-03 > ... > > lets say i would like a batch process to read data for the latest date > only. how do i proceed? > generally the latest date will be yesterday, but it could be a day older > or maybe 2. > > i understand that i will have to do something like: > df.filter(df("date") === some_date_string_here) > > however i do now know what some_date_string_here should be. i would like > to inspect the available dates and pick the latest. is there an efficient > way to find out what the available partitions are? > > thanks! koert > > >
Re: spark sql partitioned by date... read last date
it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuiperswrote: > it seems to work but i am not sure if its not scanning the whole dataset. > let me dig into tasks a a bit > > On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam wrote: > >> Hi Koert, >> >> If the partitioned table is implemented properly, I would think "select >> distinct(date) as dt from table order by dt DESC limit 1" would return the >> latest dates without scanning the whole dataset. I haven't try it that >> myself. It would be great if you can report back if this actually works or >> not. :) >> >> Best Regards, >> >> Jerry >> >> >> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers wrote: >> >>> hello all, >>> i am trying to get familiar with spark sql partitioning support. >>> >>> my data is partitioned by date, so like this: >>> data/date=2015-01-01 >>> data/date=2015-01-02 >>> data/date=2015-01-03 >>> ... >>> >>> lets say i would like a batch process to read data for the latest date >>> only. how do i proceed? >>> generally the latest date will be yesterday, but it could be a day older >>> or maybe 2. >>> >>> i understand that i will have to do something like: >>> df.filter(df("date") === some_date_string_here) >>> >>> however i do now know what some_date_string_here should be. i would like >>> to inspect the available dates and pick the latest. is there an efficient >>> way to find out what the available partitions are? >>> >>> thanks! koert >>> >>> >>> >> >
Re: spark sql partitioned by date... read last date
I agreed the max date will satisfy the latest date requirement but it does not satisfy the second last date requirement you mentioned. Just for your information, before you invested in the partitioned table too much, I want to warn you that it has memory issues (both on executors and driver side). A simple experiment can show that if you have over 10 years of date (3650 directories), it takes a long time to initialize. I got to know the limitation after I tried to partition user events per their user_id. It was a disaster (>1 user_id). I hope the spark developer can address the memory limitations because partitioned table is very useful in many cases. Cheers ~ On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuiperswrote: > i was going for the distinct approach, since i want it to be general > enough to also solve other related problems later. the max-date is likely > to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physical plan should say something about it. I >> wonder if you are trying the distinct-sort-by-limit approach or the >> max-date approach? >> >> Best Regards, >> >> Jerry >> >> >> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers wrote: >> >>> it seems pretty fast, but if i have 2 partitions and 10mm records i do >>> have to dedupe (distinct) 10mm records >>> >>> a direct way to just find out what the 2 partitions are would be much >>> faster. spark knows it, but its not exposed. >>> >>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote: >>> it seems to work but i am not sure if its not scanning the whole dataset. let me dig into tasks a a bit On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam wrote: > Hi Koert, > > If the partitioned table is implemented properly, I would think > "select distinct(date) as dt from table order by dt DESC limit 1" would > return the latest dates without scanning the whole dataset. I haven't try > it that myself. It would be great if you can report back if this actually > works or not. :) > > Best Regards, > > Jerry > > > On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers > wrote: > >> hello all, >> i am trying to get familiar with spark sql partitioning support. >> >> my data is partitioned by date, so like this: >> data/date=2015-01-01 >> data/date=2015-01-02 >> data/date=2015-01-03 >> ... >> >> lets say i would like a batch process to read data for the latest >> date only. how do i proceed? >> generally the latest date will be yesterday, but it could be a day >> older or maybe 2. >> >> i understand that i will have to do something like: >> df.filter(df("date") === some_date_string_here) >> >> however i do now know what some_date_string_here should be. i would >> like to inspect the available dates and pick the latest. is there an >> efficient way to find out what the available partitions are? >> >> thanks! koert >> >> >> > >>> >> >
Re: spark sql partitioned by date... read last date
Hi Koert, You should be able to see if it requires scanning the whole data by "explain" the query. The physical plan should say something about it. I wonder if you are trying the distinct-sort-by-limit approach or the max-date approach? Best Regards, Jerry On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuiperswrote: > it seems pretty fast, but if i have 2 partitions and 10mm records i do > have to dedupe (distinct) 10mm records > > a direct way to just find out what the 2 partitions are would be much > faster. spark knows it, but its not exposed. > > On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote: > >> it seems to work but i am not sure if its not scanning the whole dataset. >> let me dig into tasks a a bit >> >> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam wrote: >> >>> Hi Koert, >>> >>> If the partitioned table is implemented properly, I would think "select >>> distinct(date) as dt from table order by dt DESC limit 1" would return the >>> latest dates without scanning the whole dataset. I haven't try it that >>> myself. It would be great if you can report back if this actually works or >>> not. :) >>> >>> Best Regards, >>> >>> Jerry >>> >>> >>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers wrote: >>> hello all, i am trying to get familiar with spark sql partitioning support. my data is partitioned by date, so like this: data/date=2015-01-01 data/date=2015-01-02 data/date=2015-01-03 ... lets say i would like a batch process to read data for the latest date only. how do i proceed? generally the latest date will be yesterday, but it could be a day older or maybe 2. i understand that i will have to do something like: df.filter(df("date") === some_date_string_here) however i do now know what some_date_string_here should be. i would like to inspect the available dates and pick the latest. is there an efficient way to find out what the available partitions are? thanks! koert >>> >> >
Re: spark sql partitioned by date... read last date
i was going for the distinct approach, since i want it to be general enough to also solve other related problems later. the max-date is likely to be faster though. On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lamwrote: > Hi Koert, > > You should be able to see if it requires scanning the whole data by > "explain" the query. The physical plan should say something about it. I > wonder if you are trying the distinct-sort-by-limit approach or the > max-date approach? > > Best Regards, > > Jerry > > > On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers wrote: > >> it seems pretty fast, but if i have 2 partitions and 10mm records i do >> have to dedupe (distinct) 10mm records >> >> a direct way to just find out what the 2 partitions are would be much >> faster. spark knows it, but its not exposed. >> >> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote: >> >>> it seems to work but i am not sure if its not scanning the whole >>> dataset. let me dig into tasks a a bit >>> >>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam wrote: >>> Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually works or not. :) Best Regards, Jerry On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers wrote: > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=2015-01-02 > data/date=2015-01-03 > ... > > lets say i would like a batch process to read data for the latest date > only. how do i proceed? > generally the latest date will be yesterday, but it could be a day > older or maybe 2. > > i understand that i will have to do something like: > df.filter(df("date") === some_date_string_here) > > however i do now know what some_date_string_here should be. i would > like to inspect the available dates and pick the latest. is there an > efficient way to find out what the available partitions are? > > thanks! koert > > > >>> >> >
spark sql partitioned by date... read last date
hello all, i am trying to get familiar with spark sql partitioning support. my data is partitioned by date, so like this: data/date=2015-01-01 data/date=2015-01-02 data/date=2015-01-03 ... lets say i would like a batch process to read data for the latest date only. how do i proceed? generally the latest date will be yesterday, but it could be a day older or maybe 2. i understand that i will have to do something like: df.filter(df("date") === some_date_string_here) however i do now know what some_date_string_here should be. i would like to inspect the available dates and pick the latest. is there an efficient way to find out what the available partitions are? thanks! koert
Re: spark sql partitioned by date... read last date
Try with max date, in your case it could make more sense to represent the date as int Sent from my iPhone > On 01 Nov 2015, at 21:03, Koert Kuiperswrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=2015-01-02 > data/date=2015-01-03 > ... > > lets say i would like a batch process to read data for the latest date only. > how do i proceed? > generally the latest date will be yesterday, but it could be a day older or > maybe 2. > > i understand that i will have to do something like: > df.filter(df("date") === some_date_string_here) > > however i do now know what some_date_string_here should be. i would like to > inspect the available dates and pick the latest. is there an efficient way to > find out what the available partitions are? > > thanks! koert > >
Re: spark sql partitioned by date... read last date
Hi Koert, the physical plan looks like it is doing the right thing: partitioned table hdfs://user/koert/test, read date from the directory names, hash partitioned and agg the date to find distinct date. Finally shuffle the dates for sort and limit 1 operations. This is my understanding of the physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lamwrote: > I agreed the max date will satisfy the latest date requirement but it does > not satisfy the second last date requirement you mentioned. > > Just for your information, before you invested in the partitioned table > too much, I want to warn you that it has memory issues (both on executors > and driver side). A simple experiment can show that if you have over 10 > years of date (3650 directories), it takes a long time to initialize. I got > to know the limitation after I tried to partition user events per their > user_id. It was a disaster (>1 user_id). > > I hope the spark developer can address the memory limitations because > partitioned table is very useful in many cases. > > Cheers ~ > > > > On Sun, Nov 1, 2015 at 4:39 PM, Koert Kuipers wrote: > >> i was going for the distinct approach, since i want it to be general >> enough to also solve other related problems later. the max-date is likely >> to be faster though. >> >> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: >> >>> Hi Koert, >>> >>> You should be able to see if it requires scanning the whole data by >>> "explain" the query. The physical plan should say something about it. I >>> wonder if you are trying the distinct-sort-by-limit approach or the >>> max-date approach? >>> >>> Best Regards, >>> >>> Jerry >>> >>> >>> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers wrote: >>> it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote: > it seems to work but i am not sure if its not scanning the whole > dataset. let me dig into tasks a a bit > > On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam > wrote: > >> Hi Koert, >> >> If the partitioned table is implemented properly, I would think >> "select distinct(date) as dt from table order by dt DESC limit 1" would >> return the latest dates without scanning the whole dataset. I haven't try >> it that myself. It would be great if you can report back if this actually >> works or not. :) >> >> Best Regards, >> >> Jerry >> >> >> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers >> wrote: >> >>> hello all, >>> i am trying to get familiar with spark sql partitioning support. >>> >>> my data is partitioned by date, so like this: >>> data/date=2015-01-01 >>> data/date=2015-01-02 >>> data/date=2015-01-03 >>> ... >>> >>> lets say i would like a batch process to read data for the latest >>> date only. how do i proceed? >>> generally the latest date will be yesterday, but it could be a day >>> older or maybe 2. >>> >>> i understand that i will have to do something like: >>> df.filter(df("date") === some_date_string_here) >>> >>> however i do now know what some_date_string_here should be. i would >>> like to inspect the available dates and pick the latest. is there an >>> efficient way to find out what the available partitions are? >>> >>> thanks! koert >>> >>> >>> >> > >>> >> >
Re: spark sql partitioned by date... read last date
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually works or not. :) Best Regards, Jerry On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuiperswrote: > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=2015-01-02 > data/date=2015-01-03 > ... > > lets say i would like a batch process to read data for the latest date > only. how do i proceed? > generally the latest date will be yesterday, but it could be a day older > or maybe 2. > > i understand that i will have to do something like: > df.filter(df("date") === some_date_string_here) > > however i do now know what some_date_string_here should be. i would like > to inspect the available dates and pick the latest. is there an efficient > way to find out what the available partitions are? > > thanks! koert > > >