Re: [google-appengine] Re: Datastore is slow on queries involving many entities, but a smallish dataset

Eric Rannaud Tue, 01 Dec 2009 13:12:50 -0800

On Tue, Dec 1, 2009 at 11:02 AM, Stephen <sdea...@gmail.com> wrote:
> On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
>>     Calendar c = Calendar.getInstance();
>>     long t0 = c.getTimeInMillis();
>>     qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
>>     System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>
> Are you fetching all 128 entities in one batch? If you don't, the
> result is fetched in batches of 20, incurring extra disk reads and rpc
> overhead.
>
> Not sure how you do that with the Java API, but with python you pass
> '128' to the .fetch() method of a query object.


As far as I can tell, there is no such equivalent in the Java API. The
query.execute() statement returns a collection that is meant to
contain all the results. I don't know how they implement the
Collection object returned by query.execute(). Google may well manage
that in batches internally, inside the object with interface
List<MessageS>, but that would be nasty for performance.

I should say that a query with 1 result takes about 30ms. 128*30 =
3840 ms. That's pretty close to what I'm seeing for 128, indicating a
linear scaling in the number of entities. Which would be really bad,
and unexpected.

It's really hard to guess what's going on internally, without any
visibility of the architecture.

To see the impact of number of entities on response time, I did some
systematic testing:

Querying elements [0,10), [0,10), [0,10), [0,20), [0,20), [0,20),
[0,30), [0,30), [0,30), ... [0, 260), [0, 260), [0, 260) by increments
of 10, in a quick succession, three times each, actually shows a
pretty good performance behavior, the largest query with 260 entities
returned taking 300ms. So there is some kind of caching happening,
maybe. I didn't see that caching behavior earlier, but I wasn't doing
queries in such a quick succession.

But if I hit randomly in the datastore, i.e., [X+0,X+10), [X+0,X+20),
[X+0,X+30), ...  [X+0, X+260), where X is random and different for
each request, 0 <= X < 500000, then pretty much all the queries take
between 1s and 4s, and we're back to more or less linear scaling in
the number of entities fetched. (With a query returning a single
entitiy taking 3s every so often.)

It does make some sense for random queries to take longer than a bunch
of queries in the same area of the datastore (except that there are no
guarantees that the locality in the datastore is related to the
ordering with respect to the field 'id'). But with the near linear
scaling in response time with the number of entities, say 30 ms per
entity, of average size 463 B, that's an implied bandwidth in the
backend of 120Kb/s. Which is not very good.

A last point, the field 'id' and the PrimaryKey of the entity MessageS
are effectively uncorrelated (with respect to their ordering). The
PrimaryKey is a String containing a MD5 hash of the content, the 'id'
is a long set incrementally.

Has anybody looked (publicly) at datastore performance depending on
query size, locality, etc? If not, I might try to gather some
extensive data, and write it up.

Thanks,
Eric.

--

You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Datastore is slow on queries involving many entities, but a smallish dataset

Reply via email to