[google-appengine] A difficult app engine optimisation problem - selecting distinct entities across a large table

Edward Hartwell Goose Mon, 14 Feb 2011 06:59:48 -0800

Hi,

I was wondering if anyone can help me with this problem. We have an
idea we'd like to implement, and we're currently unable to do this
efficiently.


I've anonymised the data as best as possible, but the structure is the
same.

We have two entities, Car and CarJourney.

Each Car has 0 to many CarJourney's.

Each Car Journey has (amongst other properties) a date associated with
it - the date the journey was started.

I wish to query by time over car journeys. I'll have two times, a
start date and an end date, where start date <= endDate, and I want to
receive the most recently started journey in that period.

So, if I had a particular car in mind, say car 123, I'd write a query
that limits by Car.key and Car.startDate, where Car.key == 123 and
Journey.startDate >= startDate and Journey.startDate <= endDate with
an ordering on Journey.startDate descending and a limit of 1.

e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month.
The query start date is 1st and the query end date is the 2nd. The
result of this query would be one Car journey, the 2nd.

Once the result of that query is returned, a very small amount of
processing is done to return a result to the user.

That's the easy bit.

But, instead of over 1 Car, I want a list of cars, where the list
contains N keys to cars.

So, I want to run the above query N times, once for every car. And I
want the latest journey for each car.

Because the time range is flexible (and thus can't be known
beforehand) we can't implement a "isMostRecent" flag, because while it
might be the most recent for now, it might not be the most recent for
the specified date parameters.

We also need to ensure that this returns promptly (current queries are
around the 3-5 second mark for a small set of data) as this goes
straight back to the user. This means that we can't use task queues,
and because the specified dates are arbitrary we can't implement mass
indexing of "isWithinDate" fields.

We tried using an async query, but because the amount of processing is
negligible the bottleneck is still the queries on the datastore
(because the async api still sends the requests synchronously, it just
doesn't block).

Ideally, we'd implement this as a select on car journeys ordered by
startDate where the Car.key is distinct, but we can't seem to pull
this off in GAE.

There are numerous small optimisations we can make (for example, some
MemCaching of repeated queries) but none have made a significant dent
in our query time. And MemCaching can only help for a maximum of 1-2
minutes (due to the inevitable forward march of time!)

Any ideas are most welcome and highly appreciated.

Thanks,
Ed




-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] A difficult app engine optimisation problem - selecting distinct entities across a large table

Reply via email to