[google-appengine] Re: Datastore is slow on queries involving many entities, but a smallish dataset

Stephen Tue, 01 Dec 2009 18:49:43 -0800


On Dec 1, 9:12 pm, Eric Rannaud <eric.rann...@gmail.com> wrote:
> On Tue, Dec 1, 2009 at 11:02 AM, Stephen <sdea...@gmail.com> wrote:
> > On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
> >>     Calendar c = Calendar.getInstance();
> >>     long t0 = c.getTimeInMillis();
> >>     qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
> >>     System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>
> > Are you fetching all 128 entities in one batch? If you don't, the
> > result is fetched in batches of 20, incurring extra disk reads and rpc
> > overhead.
>
> > Not sure how you do that with the Java API, but with python you pass
> > '128' to the .fetch() method of a query object.
>
> As far as I can tell, there is no such equivalent in the Java API. The
> query.execute() statement returns a collection that is meant to
> contain all the results. I don't know how they implement the
> Collection object returned by query.execute(). Google may well manage
> that in batches internally, inside the object with interface
> List<MessageS>, but that would be nasty for performance.



Something like this..?


DatastoreService datastore =
    DatastoreServiceFactory.getDatastoreService();

Query query = new Query("MessageS");
query.addFilter("id", Query.FilterOperator.GREATER_THAN_OR_EQUAL, 0);

List<Entity> messages = datastore.prepare(query)
    .asList(FetchOptions.Builder.withLimit(128));

You might also have to tweak chunkSize and/or prefetchSize, or ask on
the Java list.


> A last point, the field 'id' and the PrimaryKey of the entity MessageS
> are effectively uncorrelated (with respect to their ordering). The
> PrimaryKey is a String containing a MD5 hash of the content, the 'id'
> is a long set incrementally.


If you are querying mostly by id then it may make sense to make the
primary key an integer id and the hash a property.

Then you would have two options for fetching: a query, like before,
but this time against the __key__ pseudo property which should be
faster than looking at the secondary index; and you can also do a
direct 'get' which should be faster than any query:

messages = MessageS.get_by_id(range(0, 128))

(Java has similar APIs)



> I should say that a query with 1 result takes about 30ms. 128*30 =
> 3840 ms. That's pretty close to what I'm seeing for 128, indicating a
> linear scaling in the number of entities. Which would be really bad,
> and unexpected.
>
> It's really hard to guess what's going on internally, without any
> visibility of the architecture.


This is a recent change. It used to be that you were charged whatever
api_cpu it took to run your query, as measured on the machines. Now
there seems to be an algorithm that generates the cost based on your
entities and query type, so it will be the same from query to query.

This is good because now your costs do not suddenly go up 30% because
Google's infrastructure is having a bad day. The incentives used to be
all wrong.  The change is bad because Google didn't announce it. Are
the api_cpu costs exactly the same as before?  If not, it is an
unannounced price in/decrease.


> Has anybody looked (publicly) at datastore performance depending on
> query size, locality, etc? If not, I might try to gather some
> extensive data, and write it up.


It would be nice to work out what the algorithm for api_cpu is...

--

You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Datastore is slow on queries involving many entities, but a smallish dataset

Reply via email to