The query window is between now and 2 months ago, with granularity of
a minute. So 15/02/2011 09:28:59 is the same as 15/02/2011 09:28:01.

The number of cars is likely to be very slow increasing, probably
won't see big jumps at all. The majority will start at the beginning
of our beta period. Once a car starts reporting journeys, it's likely
to continue reporting journeys at the same rate (which could be
somewhere between every 3 seconds and every 24 hours) indefinitely.

Both Stephen and your ideas sound like really good steps. I'll discuss
them with my team later and try and get one implemented. Fingers
crossed!

And of course - thanks for the help!

On Feb 15, 3:42 am, Robert Kluin <robert.kl...@gmail.com> wrote:
> Hi Ed,
>   I think Stephen's most recent idea (key name as date + car id) might
> work well.  I have a suggestion that is somewhat similar to Nick's
> idea.
>
>   How wide and variable is the query window?  How variable are the
> occurrences of car journey data?  Once a car starts having journeys,
> will it continue having journeys for a (relatively) long time?  You
> said there would be up to 2,000 cars, are new cars frequently
> introduced and old cars removed?
>
>   Depending on those questions, perhaps you could build your own
> index.  Think about an entity that stores a (non-indexed) list of cars
> / car journeys.  If you make the key_name for that entity "lower
> resolution" than the start time, you store the most recent journey for
> each car.  If you construct these entities properly, then you could
> easily build a list of the keys needed to completely enclose a
> particular query window by fetching a small number of these lists.  If
> you also create higher-level groups, you can also efficiently query
> over very large periods of time.
>
>   The key names might look something like:
>    201102151430:  15 Feb 2011 14:30:  Journeys that occurred between
> 14:25 and 14:30.
>    2011021514:  15 Feb 2011 14: Journeys that occurred from 13:01 to 14:00
>    20110215:  15 Feb 2011: Journeys that occurred on 15 Feb 2011
>    etc...
>
>   You might use a list of tuples on the entity that stores the car id,
> journey id, and start time.  That will allow you to efficiently fetch
> the cars and the journeys and correctly combine and filter the results
> in memory  when you need more than one index entity.
>
>   You would need to construct the key names based on your query
> windows.  With some planning, the technique work very well even for
> 'continuous' query windows that are highly variable in size.  I've
> used this technique for reporting and am very pleased with it.  It is
> also relatively easy to implement.  You just need a background process
> that maintains your indexes, it can be (almost) real-time or an
> offline background process.
>
> Robert
>
>
>
>
>
>
>
> On Mon, Feb 14, 2011 at 18:53, nickmilon <nickmi...@gmail.com> wrote:
> > I have faced same kind of problem some time ago.
> > I tried some of the solutions suggested here (in memory sort and
> > filtering, encoding things into keys etc. and I have benchmarked those
> > for both latency and cpu cycles using some test data around 100K
> > entities)
> > An other approach I have taken is encoding the date as an integer (day
> > since start of epoch or day since start of year, same for hour of day
> > or month depending on how much detail you need in your output) and
> > saving this into a property. This way you turn your date query filter
> > into an equality only filter which does not even needs to specify an
> > index) then you can sort or filter on other properties.
> > Benchmarking the latest solution I have found that when the filtered
> > result set is a small fraction of the unfiltered original set, is  1+
> > order of magnitude faster and cpu-eficient. Worst case when no
> > reduction of the result set due to filtering the latency and cpu usage
> > was comparable to the previous solutions)
>
> > Hope this helps, or did I missed something ?
> > Happy coding-:)
>
> > On Feb 14, 11:51 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> >> Okay I think I got something that might work. Reverse the StartDate and
> >> CarId for the key from what I said above so the key would look like this:
> >> 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:
>
> >> select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <=
> >> MAKEKEY(EndDate + CarId) order by __key__ DESC
>
> >> Now, you can use the Async query to start processing. You're going to get
> >> entries that you're not interested in but you're only getting the key field
> >> back and not the whole CarJourney entry and this key/id has the Date and 
> >> Car
> >> ID, so the first time you hit a Car ID for each Car then you have the ID 
> >> for
> >> the latest CarJourney for that car. Now, once you've found all car ID's 
> >> your
> >> looking for you can abort the query or you'll reach the end of the query
> >> results. Now, as you're looping, store the KEYs of the entries your looking
> >> for and then do a batch GET on memcache to retrieve as many Car (you've got
> >> the car id) and CarJourney (you've got the carjourney id) entries that 
> >> might
> >> be stored there and then for any that you didn't get from memcache, you can
> >> do a batch GET on the datastore using the keys/ids that you have.
>
> >> I think that if you memcache things appropriately and use the batch gets 
> >> for
> >> memcache and datastore then this might just work for you.
>
> >> Let me know what you think. It's an interestng problem,
> >> Stephen
>
> >> On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson 
> >> <onepagewo...@gmail.com>wrote:
>
> >> > Or maybe it blocks on different result sets just not on getting the next
> >> > fetch block?? Hmmm. Sounds like a tough problem.
>
> >> > On Mon, Feb 14, 2011 at 2:09 PM, Stephen Johnson 
> >> > <onepagewo...@gmail.com>wrote:
>
> >> >> Are you using .asList (which I think blocks like you describe), but I
> >> >> thought asIterable or asIterator wasn't suppose to. (if you're using 
> >> >> Java).
>
> >> >> On Mon, Feb 14, 2011 at 12:38 PM, Edward Hartwell Goose <
> >> >> ed.go...@gmail.com> wrote:
>
> >> >>> Hi Calvin & Stephen,
>
> >> >>> Thanks for the ideas.
>
> >> >>> Calvin:
> >> >>> We can't do the filtering in memory. We potentially have a car making
> >> >>> a journey (the car analogy isn't so good...) making a journey every 3
> >> >>> seconds, and we could have up to 2,000 cars.
>
> >> >>> We need to be able to look back up to 2 months, so it could be up to
> >> >>> 1.8 billion rows in this table.
>
> >> >>> Stephen:
> >> >>> That's an interesting idea. However the Asynchronous api actually
> >> >>> fires the requests synchronously, it just doesn't block. (Or at least,
> >> >>> that's my experience).
>
> >> >>> So, at the moment we fire off 1 query (which Google turns into 2) for
> >> >>> each site. And although the method call returns instantly, it still
> >> >>> takes ~5 seconds in total with basic test data. If each call takes
> >> >>> 12ms, we still have to wait 24 seconds for 2,000 sites.
>
> >> >>> So, the first call starts at time 0, the second call starts at 0+12,
> >> >>> the third at 0+12+12... etc. With 2,000 sites, this works out about 24
> >> >>> seconds. Once you've added in the overheads and getting the list of
> >> >>> Cars in the first place, it's too long.
>
> >> >>> If we could start even 100 queries at the same time of time 0, that'd
> >> >>> be superb. We thought we could do it with multithreading, but that's
> >> >>> not allowed on App Engine.
>
> >> >>> Finally - I've also posted this on StackOverflow -
>
> >> >>>http://stackoverflow.com/questions/4993744/selecting-distinct-entitie...
>
> >> >>> I'll try and keep both updated.
>
> >> >>> Any more thoughts welcome!
> >> >>> Ed
>
> >> >>> On Feb 14, 6:47 pm, Calvin <calvin.r...@gmail.com> wrote:
> >> >>> > Can you do filtering in memory?
>
> >> >>> > This query would give you all of the journeys for a list of cars 
> >> >>> > within
> >> >>> the
> >> >>> > date range:
> >> >>> > carlist = ['123','333','543','753','963','1236']
> >> >>> > start_date = datetime.datetime(2011, 1, 30)
> >> >>> > end_date = datetime(2011, 2, 10)
>
> >> >>> > journeys = Journey.all().filter('start >', start_date).filter('start
> >> >>> <',
> >> >>> > end_date).filter('car IN', carlist).order('-start').fetch(100)
> >> >>> > len(journeys)
> >> >>> > 43 # <- since it's less than 100 I know I've gotten them all
>
> >> >>> > then since the list is sorted I know the first entry per car is the
> >> >>> most
> >> >>> > recent journey:
>
> >> >>> > results = {}
> >> >>> > for journey in journeys:
> >> >>> > ...   if journey.car in results:
> >> >>> > ...     continue
> >> >>> > ...   results[journey.car] = journey
>
> >> >>> > len(results)
> >> >>> > 6
>
> >> >>> > for result in results.values():
> >> >>> > ...   print("%s : %s" % (result.car, result.start))
> >> >>> > 753 : 2011-02-09 12:38:48.887976
> >> >>> > 1236 : 2011-02-06 13:59:35.221003
> >> >>> > 963 : 2011-02-08 14:03:54.587609
> >> >>> > 333 : 2011-02-09 10:40:09.466700
> >> >>> > 543 : 2011-02-09 15:28:53.197123
> >> >>> > 123 : 2011-02-09 14:09:02.680870
>
> >> >>> --
> >> >>> You received this message because you are subscribed to the Google 
> >> >>> Groups
> >> >>> "Google App Engine" group.
> >> >>> To post to this group, send email to google-appengine@googlegroups.com.
> >> >>> To unsubscribe from this group, send email to
> >> >>> google-appengine+unsubscr...@googlegroups.com.
> >> >>> For more options, visit this group at
> >> >>>http://groups.google.com/group/google-appengine?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "Google App Engine" group.
> > To post to this group, send email to google-appengine@googlegroups.com.
> > To unsubscribe from this group, send email to 
> > google-appengine+unsubscr...@googlegroups.com.
> > For more options, visit this group 
> > athttp://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to