[google-appengine] Re: Bulk data deletion woe

Erik Sun, 14 Nov 2010 20:47:42 -0800

Thanks for the well thought response, numbers, and reality check
Stephen!  That makes a lot of sense when you consider parallel deletes
and datastore CPU time.


On Nov 14, 9:37 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> Thank you for sharing your numbers with us. I think it's a good way for all
> of us to get an idea of how much things cost on the cloud, so here's my
> thoughts.
>
> Even though you had one shard executing the shard should be doing batch
> deletes and not one delete at a time. From the documentation batch deletes
> can do up to 500 entities in one call and would execute in parallel (perhaps
> not 500 all at once but with parallelism none the less). I would assume the
> shard would probably do about 100 or so at a time (maybe more / maybe less).
>
> Anyway, a good way to prove some parallelism must be occurring would be to
> do a proof by negation. So, let's assume that in fact the shard is doing one
> delete at a time. Looking at the System Status the latency of a single
> delete on an entity (probably a very simple entity with no composite indexes
> which would add additional overhead) is approximately 50ms to 100ms or so.
> If we assume 50ms per delete for latency we end up with (assuming no
> overhead for mapreduce/shard maintenance and spawning additional tasks, etc.
> which would add even more additional time).
>
>     300000 entities * .05 seconds per entitiy = 15000 seconds
>     15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
> minutes
>
> Additionally if a delete takes approximately 100 milliseconds then 300000
> entities would take 8 hours 20 minutes to complete.
> Even an unrealistic 25ms per delete is still over two hours.
>
> Now remember this is latency (real time) and not CPU time. So even if
> something has latency time of 50ms it could still eat up 100ms of API CPU
> time. For example 50ms to delete the entity and 50ms to update the indexes
> (done in parallel). So if latency time is 4 hours 10 minutes and we just
> double latency time to approximate API CPU time we get over 8 hours of CPU
> time. If average delete time for your job was 75ms then latency time is
> approximately 6 hours and CPU time 12 hours. Your total was 11 hours billed
> time so if my logic is sound it seems reasonable the amount you were billed
> could be correct.
>
> Furthermore if we take another look at this from another angle we find that
> if your delete job took 15 minutes to complete then:
>
> 300000 entities / 15 minutes = 20000 entities per minute
> 20000 entities per minute / 60 seconds per minute = 333.33 entities per
> second
>
> So, if 333.33 entities are being deleted per second serially then the
> average latency would be 3ms per delete which seems rather unlikely.
>
> My thoughts. Hope it helps (and I hope my math is right),
> Steve
>
> On Sun, Nov 14, 2010 at 2:57 PM, Erik <erik.e.wil...@gmail.com> wrote:
>
> > On Nov 14, 1:32 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> > > Why do you say that's silly? If your map reduce task does bulk deletes
> > and
> > > let's say they do 100 at a time, then those 100 deletes are done in
> > > parallel. So that's 100x. So for each second of delete real time you're
> > > getting 100 seconds of CPU time.  You should be pleased that instead of
> > your
> > > task taking 11 hours to delete all your data it took only 15 minutes.
> > Isn't
> > > that scalability? Isn't that what you're looking for? How many entities
> > did
> > > you delete? How many indexes did you have (composite and single
> > property)?
>
> > This was using only 1 shard per kind that was being deleted, so
> > effectively there should be no parallelism occurring, unless there is
> > something I am missing?
> > Deleted about ~300k entities, each with a single indexed collection.
>
> > > On Sun, Nov 14, 2010 at 10:29 AM, Erik <erik.e.wil...@gmail.com> wrote:
>
> > > > If you check in the datastore viewer you might be able to find and
> > > > delete your jobs from one of the tables.  You may also need to go into
> > > > your task queues and purge the default.
>
> > > > On this topic, why does deleting data have such a large difference
> > > > between actual time spent and billed time?
>
> > > > For instance, I had two mapreduce shards running to delete data, which
> > > > took a combined a total of 15 minutes, but I was actually charged for
> > > > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> > > > difference is a little silly!
>
> > > > On Nov 14, 4:25 am, Justin <justin.worr...@gmail.com> wrote:
> > > > > I've been trying to bulk delete data from my application as described
> > > > > here
>
> >http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> > > > > This seems to have kicked off a series of mapreduce workers, whose
> > > > > execution is killing my CPU - approximately 5 mins later I have
> > > > > reached 100% CPU time and am locked out for the rest of the day.
>
> > > > > I figure I'll just delete by hand; create some appropriate :delete
> > > > > controllers and wait till the next day.
>
> > > > > Unfortunately the mapreduce process still seems to be running - 10
> > > > > past midnight and my CPU has reached 100% again.
>
> > > > > Is there some way to kill these processes and get back control of my
> > > > > app?
>
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "Google App Engine" group.
> > > > To post to this group, send email to google-appengine@googlegroups.com
> > .
> > > > To unsubscribe from this group, send email to
> > > > google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> > <google-appengine%2bunsubscr...@googlegroups.com<google-appengine%252bunsubscr...@googlegroups.com>
>
> > > > .
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/google-appengine?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to google-appeng...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Bulk data deletion woe

Reply via email to