Preface: Please note, I'm not speaking for google at all in this note
and a lot of what I've written is speculation based on what I've read
in various GAE docs as well as some meager knowledge of how relational
DBs generally work.  And yes, I know datastore isn't a relational DB,
but I believe that their indexing implementation likely runs into many
of the same problems you have with indexing relational data although
that assumption could be completely wrong.

>From what I can tell, the update bottleneck you're referring to is for
updating what you would often think of as a single record if you were
persisting one instance of your User as a single denormalized record
in a relational schema.  I suspect this bottleneck is due to the
datastore architecture and the way that data updates are accumulated
(possibly grouped/keyed by PK) in a queue, which is probably read from
like a cache if read requests come in before the data has been flushed
into the actual storage medium and replicated to the other
datacenters.

So if each of your users were updating their own User records, I don't
believe you'd experience that limitation which may be an artifact of
how those in-memory queue/cache structures are managed/locked during
updates (i.e. a new update for a record may be held until it's been
flushed from the queue to the storage medium to prevent having to
merge/reconcile records in the queue).  If they were all updating a
single shared record, then I think you'd hit this pretty quick.

Let's say though that your users are updating separate records...as
your data size grows, you will probably see your update throughput
decrease as other factors become dominant, and I believe this will
primarily be dependent on the number and composition of the indexes on
your data as well as the number of entities persisted.  To me, this is
the much riskier unknown because your average index structure is
harder to update piecewise in parallel because the index must allow
you to order/search all of the records' indexed columns.  In an RDBMS
like SQL Server or Oracle, you'd see some level of index locking take
place during each transaction (maybe one page of an index) to allow
concurrent updates to different sections of an index before the
updates are committed, the transaction is ended and the locks are
released.

In relational persistence systems, this gets slower as the indexes
become larger and is usually overcome with a technique like
partitioning which, if you aren't familiar with it, sort of gives you
a top level of your index tree where the data is actually spread into
n groups of tables/indexes depending on some value in each record, and
you usually pick a partition key so that data volume in each partition
is kind of naturally balanced because rebalancing across partitions is
expensive.  I'm not sure that any kind of similar mechanism has been
exposed in the GAE datastore right now and so a single index declared
for an entity type is probably realized as one big index.  I would
hope that there's sub-index granularity for locking during updates,
but I'm actually guessing that's not the case for a couple of reasons:

1) With most relational systems, you need to periodically rebuild the
index or at least refresh the index statistics.  I like to simplify
this and think of rebuilding as rebalancing the data tree for optimal
access speed while refreshing statistics typically just helps query
optimizers decide whether use of an index should be preferred.  On the
GAE though, they require you to have an index for each combination of
query parameters, so I suspect that statistics don't come into play.
And I haven't seen a "rebuild my indexes" function in the admin UI
although admittedly I haven't looked for one too hard so I wonder if
they aren't trying to keep the data tree somewhat well balanced during
each data update, which would require the entire index to potentially
be locked.

2) I also haven't read anything yet about deadlock situations on GAE
which can happen surprisingly easily if you're updating multiple
indexes with enough concurrency and are using page locking.  If you
were designing the GAE datastore service, the way to avoid that
situation would be to lock all indexes on each data update in the same
order every time.  You'd sacrifice a lot of throughput, but you'd
never hit a deadlock so I suspect they've done something like this
behind the scenes unless people just aren't using GAE heavily enough
yet or the good people of the GAE have used some special sauce in the
datastore service impl.

So I guess what I'm trying to say is that I don't believe that you
should be satisfied with any particular bit of performance data from
another application because your mileage will almost certainly vary.
I think that If you really want to know how your application would
perform and want to find out before writing the whole app and sharing
it with a billion users, I would recommend a very empirical approach:

I'd write a sample app with with entity group where entity widths and
indexes are those that you think will be representative of your
deployed application and then add a simple test harness that will:

a) seed data to a point that you think is representative
b) update and query your data in what you believe will be a worst case
scenario and then record the times

I think the resulting curve of performance you see will be highly
dependent on how you vary the seed data size and the number of
indexes.  Of course there are more dimensions than that, such as the #
of concurrent read operations and the # of concurrent write
operations, that you can vary as well depending on what your
performance requirements are.

I hope this is somewhat helpful and I also hope that it's not totally
incorrect and misleading since, as I said, it's all rampant
speculation based on somewhat limited publicly available data.

-Michael

P.S.  Of course, if anyone has data including # of records, #/
composisiton of indexes, # reads per hour, # writes per hour and
latency per txn, I'd be fascinated to hear about it too!

On Oct 19, 4:01 pm, Diana Cruise <diana.l.cru...@gmail.com> wrote:
> This is exactly what I'm am talking about...in my case the User and
> UserAddr are both in the same Entity Group.  So, are you saying that
> my application which has a global presence in GAE can only support 25
> simultaneous Users performing this update in under 5 seconds?
>
> Again, I take 1-10 requests per second response and go with the avg of
> 5/s.  Add up 25 Users simultaneously hitting this Entity Group and
> that consumes a full 5 seconds.  So, if you have 25 Users doing the
> same update over and over they will each have about a 5 second
> response.
>
> I know I am wrong because this is way LOW for a Google platform or any
> other...I just am NOT hearing or seeing numbers that say otherwise.
>
> If you clarify for me that this Entity Group performance stat of 1-10/
> s is granular to the Row then we're on to something...  That would
> tell me that my scenario above only applies if ALL Users were logged
> into the same account!!!  If the Entity Group performance stat is
> granular to the Row then that would mean an infinite number of Users
> would average 5 updates per second.  Please tell me this is TRUE!
>
> Otherwise, if this Entity Group performance stat of 1-10/s is granular
> to the whole group (is ALL rows) then the performance is dire as I
> described originally.  Please tell me it isn't so!
>
> On Oct 19, 11:10 am, Don Schwarz <schwa...@google.com> wrote:
>
> > It's 1-10 updates per second per Entity 
> > Group:http://code.google.com/appengine/docs/java/datastore/transactions.htm...
>
> > You need to break your design up into Entity Groups according to which
> > pieces will need to be updated in a single transaction.  In the best case,
> > each entity can be its own entity group and the only restriction is that you
> > update each entity no more often than 1-10 times per second.
>
> > For example, it would not be a good idea to store a global counter in one
> > entity unless you planned to update it no more than 1-10 times per second.
> >  The solution to this is to use sharded counters:
>
> >http://code.google.com/appengine/articles/sharding_counters.html
>
> > On Mon, Oct 19, 2009 at 11:06 AM, Diana Cruise 
> > <diana.l.cru...@gmail.com>wrote:
>
> > > Shawn, the 1-10s per Update was sited from Max Ross' I/O Video and
> > > I've seen it in various talks/docs along the way...
>
> > > On Oct 19, 11:03 am, Diana Cruise <diana.l.cru...@gmail.com> wrote:
> > > > Shawn, the docs link you site is riddled with numbers (easy to get
> > > > lost in them and what they truely mean)...which is why I included a
> > > > simplest of scenarios above, that being to simply add a home
> > > > addressbook entry attached to a User.  Surely someone has a sizeable
> > > > production system today in GAE that could share load results and real
> > > > costs.  If noone does, then that is also very troubling.
>
> > > > Gaurav, I assume too that reading is NOT the problem and by this post
> > > > am hoping to get real-world numbers to a simple update transaction.
> > > > But, we need production app feedback from the most popular apps out
> > > > there.  Is there such a list for Java for GAE yet?  Surely, there are
> > > > large production apps by now?
>
> > > > On Oct 19, 2:17 am, Gaurav <ano...@gmail.com> wrote:
>
> > > > > GAE performs best for simultaneous read operations. So there could be
> > > > > virtually any no.
> > > > > of users reading at the same time, no issue. But when it comes to
> > > > > making updates
> > > > > performance degradation is significant.
> > > > > To get a better understanding of how gae performs under heavy load, I
> > > > > recommend
> > > > > this video :http://www.youtube.com/watch?v=AgaL6NGpkB8
>
> > > > > On Oct 19, 7:09 am, Shawn Brown <big.coffee.lo...@gmail.com> wrote:
>
> > > > > > Hi,
>
> > > > > > > I have read that a particular User can process 1-10 request per
> > > > > > > second.  Is this a limit of the free quota or does paid quota also
> > > > > > > have this limitation.
>
> > > > > > Where and what did you read?
>
> > > > > > That doesn't seem consistent with the published limits.  I guess it
> > > > > > depends on what they were doing with the app.
> > >http://code.google.com/appengine/docs/quotas.html
>
> > > > > > Shawn- Hide quoted text -
>
> > > > > - Show quoted text -- Hide quoted text -
>
> > > > - Show quoted text -- Hide quoted text -
>
> > - Show quoted text -
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to google-appengine-java@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine-java+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to