Re: [google-appengine] Re: A difficult app engine optimisation problem - selecting distinct entities across a large table

Stephen Johnson Tue, 15 Feb 2011 08:41:01 -0800

Hi Ed,
One issue you'll have is where you have a large time frame to query over and
some cars that make infrequent journeys and cars that make very frequent
journeys so you'll have to scan a lot of key entries to find all cars, so as
a further optimization, if you are able to somehow classify a car by it's
general frequency of journal entries, you could assign slot numbers to the
cars. It wouldn't matter if sometimes they deviate from this general
frequency. For example, create five slots, let's call them A,B,C,D,E and
let's say car 123 infrequently makes journeys then they go to slot A and car
789 makes frequent journeys so they go to slot E. Then you can add that slot
number to the Car entity information and add the slot number to the
beginning of the CarJournal keys so it now looks like:


    SLOT:DATE:CAR-ID
    A:2011:02:14:17:13:33:123

So, now when doing your query for cars, you'll get the cars your interested
in and determine the slot they are in and group the cars according to slot
and run the following query per slot for the cars in that slot:

select __key__ where __key__ >= MAKEKEY(Slot + StartDate + CarId) && __key__
<= MAKEKEY(Slot + EndDate + CarId) order by __key__ DESC

Now, you'll have more queries to run (up to five or maybe 3 if you choose
three slots, etc.), but if you're classification of cars is even mildly
accurate over a long period you'll probably save yourself a lot of scanning.


Let us know how your testing goes and what you end up doing,
Stephen

On Tue, Feb 15, 2011 at 2:32 AM, Edward Hartwell Goose
<ed.go...@gmail.com>wrote:

> The query window is between now and 2 months ago, with granularity of
> a minute. So 15/02/2011 09:28:59 is the same as 15/02/2011 09:28:01.
>
> The number of cars is likely to be very slow increasing, probably
> won't see big jumps at all. The majority will start at the beginning
> of our beta period. Once a car starts reporting journeys, it's likely
> to continue reporting journeys at the same rate (which could be
> somewhere between every 3 seconds and every 24 hours) indefinitely.
>
> Both Stephen and your ideas sound like really good steps. I'll discuss
> them with my team later and try and get one implemented. Fingers
> crossed!
>
> And of course - thanks for the help!
>
> On Feb 15, 3:42 am, Robert Kluin <robert.kl...@gmail.com> wrote:
> > Hi Ed,
> >   I think Stephen's most recent idea (key name as date + car id) might
> > work well.  I have a suggestion that is somewhat similar to Nick's
> > idea.
> >
> >   How wide and variable is the query window?  How variable are the
> > occurrences of car journey data?  Once a car starts having journeys,
> > will it continue having journeys for a (relatively) long time?  You
> > said there would be up to 2,000 cars, are new cars frequently
> > introduced and old cars removed?
> >
> >   Depending on those questions, perhaps you could build your own
> > index.  Think about an entity that stores a (non-indexed) list of cars
> > / car journeys.  If you make the key_name for that entity "lower
> > resolution" than the start time, you store the most recent journey for
> > each car.  If you construct these entities properly, then you could
> > easily build a list of the keys needed to completely enclose a
> > particular query window by fetching a small number of these lists.  If
> > you also create higher-level groups, you can also efficiently query
> > over very large periods of time.
> >
> >   The key names might look something like:
> >    201102151430:  15 Feb 2011 14:30:  Journeys that occurred between
> > 14:25 and 14:30.
> >    2011021514:  15 Feb 2011 14: Journeys that occurred from 13:01 to
> 14:00
> >    20110215:  15 Feb 2011: Journeys that occurred on 15 Feb 2011
> >    etc...
> >
> >   You might use a list of tuples on the entity that stores the car id,
> > journey id, and start time.  That will allow you to efficiently fetch
> > the cars and the journeys and correctly combine and filter the results
> > in memory  when you need more than one index entity.
> >
> >   You would need to construct the key names based on your query
> > windows.  With some planning, the technique work very well even for
> > 'continuous' query windows that are highly variable in size.  I've
> > used this technique for reporting and am very pleased with it.  It is
> > also relatively easy to implement.  You just need a background process
> > that maintains your indexes, it can be (almost) real-time or an
> > offline background process.
> >
> > Robert
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 14, 2011 at 18:53, nickmilon <nickmi...@gmail.com> wrote:
> > > I have faced same kind of problem some time ago.
> > > I tried some of the solutions suggested here (in memory sort and
> > > filtering, encoding things into keys etc. and I have benchmarked those
> > > for both latency and cpu cycles using some test data around 100K
> > > entities)
> > > An other approach I have taken is encoding the date as an integer (day
> > > since start of epoch or day since start of year, same for hour of day
> > > or month depending on how much detail you need in your output) and
> > > saving this into a property. This way you turn your date query filter
> > > into an equality only filter which does not even needs to specify an
> > > index) then you can sort or filter on other properties.
> > > Benchmarking the latest solution I have found that when the filtered
> > > result set is a small fraction of the unfiltered original set, is  1+
> > > order of magnitude faster and cpu-eficient. Worst case when no
> > > reduction of the result set due to filtering the latency and cpu usage
> > > was comparable to the previous solutions)
> >
> > > Hope this helps, or did I missed something ?
> > > Happy coding-:)
> >
> > > On Feb 14, 11:51 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> > >> Okay I think I got something that might work. Reverse the StartDate
> and
> > >> CarId for the key from what I said above so the key would look like
> this:
> > >> 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:
> >
> > >> select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__
> <=
> > >> MAKEKEY(EndDate + CarId) order by __key__ DESC
> >
> > >> Now, you can use the Async query to start processing. You're going to
> get
> > >> entries that you're not interested in but you're only getting the key
> field
> > >> back and not the whole CarJourney entry and this key/id has the Date
> and Car
> > >> ID, so the first time you hit a Car ID for each Car then you have the
> ID for
> > >> the latest CarJourney for that car. Now, once you've found all car
> ID's your
> > >> looking for you can abort the query or you'll reach the end of the
> query
> > >> results. Now, as you're looping, store the KEYs of the entries your
> looking
> > >> for and then do a batch GET on memcache to retrieve as many Car
> (you've got
> > >> the car id) and CarJourney (you've got the carjourney id) entries that
> might
> > >> be stored there and then for any that you didn't get from memcache,
> you can
> > >> do a batch GET on the datastore using the keys/ids that you have.
> >
> > >> I think that if you memcache things appropriately and use the batch
> gets for
> > >> memcache and datastore then this might just work for you.
> >
> > >> Let me know what you think. It's an interestng problem,
> > >> Stephen
> >
> > >> On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson <
> onepagewo...@gmail.com>wrote:
> >
> > >> > Or maybe it blocks on different result sets just not on getting the
> next
> > >> > fetch block?? Hmmm. Sounds like a tough problem.
> >
> > >> > On Mon, Feb 14, 2011 at 2:09 PM, Stephen Johnson <
> onepagewo...@gmail.com>wrote:
> >
> > >> >> Are you using .asList (which I think blocks like you describe), but
> I
> > >> >> thought asIterable or asIterator wasn't suppose to. (if you're
> using Java).
> >
> > >> >> On Mon, Feb 14, 2011 at 12:38 PM, Edward Hartwell Goose <
> > >> >> ed.go...@gmail.com> wrote:
> >
> > >> >>> Hi Calvin & Stephen,
> >
> > >> >>> Thanks for the ideas.
> >
> > >> >>> Calvin:
> > >> >>> We can't do the filtering in memory. We potentially have a car
> making
> > >> >>> a journey (the car analogy isn't so good...) making a journey
> every 3
> > >> >>> seconds, and we could have up to 2,000 cars.
> >
> > >> >>> We need to be able to look back up to 2 months, so it could be up
> to
> > >> >>> 1.8 billion rows in this table.
> >
> > >> >>> Stephen:
> > >> >>> That's an interesting idea. However the Asynchronous api actually
> > >> >>> fires the requests synchronously, it just doesn't block. (Or at
> least,
> > >> >>> that's my experience).
> >
> > >> >>> So, at the moment we fire off 1 query (which Google turns into 2)
> for
> > >> >>> each site. And although the method call returns instantly, it
> still
> > >> >>> takes ~5 seconds in total with basic test data. If each call takes
> > >> >>> 12ms, we still have to wait 24 seconds for 2,000 sites.
> >
> > >> >>> So, the first call starts at time 0, the second call starts at
> 0+12,
> > >> >>> the third at 0+12+12... etc. With 2,000 sites, this works out
> about 24
> > >> >>> seconds. Once you've added in the overheads and getting the list
> of
> > >> >>> Cars in the first place, it's too long.
> >
> > >> >>> If we could start even 100 queries at the same time of time 0,
> that'd
> > >> >>> be superb. We thought we could do it with multithreading, but
> that's
> > >> >>> not allowed on App Engine.
> >
> > >> >>> Finally - I've also posted this on StackOverflow -
> >
> > >> >>>
> http://stackoverflow.com/questions/4993744/selecting-distinct-entitie...
> >
> > >> >>> I'll try and keep both updated.
> >
> > >> >>> Any more thoughts welcome!
> > >> >>> Ed
> >
> > >> >>> On Feb 14, 6:47 pm, Calvin <calvin.r...@gmail.com> wrote:
> > >> >>> > Can you do filtering in memory?
> >
> > >> >>> > This query would give you all of the journeys for a list of cars
> within
> > >> >>> the
> > >> >>> > date range:
> > >> >>> > carlist = ['123','333','543','753','963','1236']
> > >> >>> > start_date = datetime.datetime(2011, 1, 30)
> > >> >>> > end_date = datetime(2011, 2, 10)
> >
> > >> >>> > journeys = Journey.all().filter('start >',
> start_date).filter('start
> > >> >>> <',
> > >> >>> > end_date).filter('car IN', carlist).order('-start').fetch(100)
> > >> >>> > len(journeys)
> > >> >>> > 43 # <- since it's less than 100 I know I've gotten them all
> >
> > >> >>> > then since the list is sorted I know the first entry per car is
> the
> > >> >>> most
> > >> >>> > recent journey:
> >
> > >> >>> > results = {}
> > >> >>> > for journey in journeys:
> > >> >>> > ...   if journey.car in results:
> > >> >>> > ...     continue
> > >> >>> > ...   results[journey.car] = journey
> >
> > >> >>> > len(results)
> > >> >>> > 6
> >
> > >> >>> > for result in results.values():
> > >> >>> > ...   print("%s : %s" % (result.car, result.start))
> > >> >>> > 753 : 2011-02-09 12:38:48.887976
> > >> >>> > 1236 : 2011-02-06 13:59:35.221003
> > >> >>> > 963 : 2011-02-08 14:03:54.587609
> > >> >>> > 333 : 2011-02-09 10:40:09.466700
> > >> >>> > 543 : 2011-02-09 15:28:53.197123
> > >> >>> > 123 : 2011-02-09 14:09:02.680870
> >
> > >> >>> --
> > >> >>> You received this message because you are subscribed to the Google
> Groups
> > >> >>> "Google App Engine" group.
> > >> >>> To post to this group, send email to
> google-appengine@googlegroups.com.
> > >> >>> To unsubscribe from this group, send email to
> > >> >>> google-appengine+unsubscr...@googlegroups.com.
> > >> >>> For more options, visit this group at
> > >> >>>http://groups.google.com/group/google-appengine?hl=en.
> >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups "Google App Engine" group.
> > > To post to this group, send email to google-appengine@googlegroups.com
> .
> > > To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> > > For more options, visit this group athttp://
> groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: A difficult app engine optimisation problem - selecting distinct entities across a large table

Reply via email to