In the Java SDK I think one could subclass the DatastoreRecordReader to do a
keys only query and return null for the entity value and use this class in
lieu of the normal DatastoreRecordReader when needed. Probably similar in
Python.

On Mon, Nov 15, 2010 at 11:47 AM, Robert Kluin <robert.kl...@gmail.com>wrote:

> In the Python MR libs, there is a DatastoreKeyInputReader input
> reader.  It looks like that is what's used to iterate over the
> entities.
>
> http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148
>
>
>
> Robert
>
>
>
>
>
>
>
>
> On Mon, Nov 15, 2010 at 13:27, Stephen Johnson <onepagewo...@gmail.com>
> wrote:
> > Yes I see what you're saying. Map Reduce would bring over the whole
> entity
> > even though it isn't needed and would consume more CPU in fetching the
> > entity not just the key. Seems almost like it would be nice to have an
> > option of Map Reduce only handing off keys and leaving out the entity.
> >
> > On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones <eli.jo...@gmail.com> wrote:
> >>
> >> This is just an anecdotal aside (in other words, I have not bothered to
> do
> >> any testing or comparison of performance).. but.. I have my own utility
> code
> >> that I use for batch deletes.
> >> Recently, I decided to wipe out all of the entities for one of my
> models,
> >> but I was too lazy to look up the exact command I needed to use in the
> >> remote console.
> >> So, I just used the new Datastore Admin page to delete them.  This page
> >> uses map reduce jobs to perform deletes.
> >> From what I could tell, the map reduce delete job took up several times
> >> more CPU time (and wall clock time) than my custom delete job usually
> took.
> >> My usual utility class uses this method for deletes:
> >> 1. Create a query for all entities in a model with keys_only = True.
> >> 2. Fetch 100 keys.
> >> 3. Issues a deferred task to delete those 100 key names.
> >> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
> >> query returns no more entities.
> >> This is usually pretty fast.. since the only bottle neck is the time it
> >> takes to fetch 100 key names and add the deferred task.
>  The surprising fact
> >> was that the default map reduce delete from the Datastore Admin page
> took so
> >> much for CPU.
> >> So, if you think you'll be doing more bulk deletes in the future, it
> might
> >> be useful to compare the CPU usage of a map reduce delete (using keys
> only
> >> and not full entities) to a method that deletes batches of 100 key names
> >> using deferred with a query cursor.
> >> Though, deleting 300,000 entities will take up a lot of CPU hours no
> >> matter what method you use.
> >> Like I said.. this is anecdotal and there could be no real difference in
> >> performance.. but the Datastore Admin delete took up way more CPU time
> than
> >> it seemed it should have, and I didn't bother to use it or test it
> again.
> >>
> >> On Sun, Nov 14, 2010 at 11:47 PM, Erik <erik.e.wil...@gmail.com> wrote:
> >>>
> >>> Thanks for the well thought response, numbers, and reality check
> >>> Stephen!  That makes a lot of sense when you consider parallel deletes
> >>> and datastore CPU time.
> >>>
> >>> On Nov 14, 9:37 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> >>> > Thank you for sharing your numbers with us. I think it's a good way
> for
> >>> > all
> >>> > of us to get an idea of how much things cost on the cloud, so here's
> my
> >>> > thoughts.
> >>> >
> >>> > Even though you had one shard executing the shard should be doing
> batch
> >>> > deletes and not one delete at a time. From the documentation batch
> >>> > deletes
> >>> > can do up to 500 entities in one call and would execute in parallel
> >>> > (perhaps
> >>> > not 500 all at once but with parallelism none the less). I would
> assume
> >>> > the
> >>> > shard would probably do about 100 or so at a time (maybe more / maybe
> >>> > less).
> >>> >
> >>> > Anyway, a good way to prove some parallelism must be occurring would
> be
> >>> > to
> >>> > do a proof by negation. So, let's assume that in fact the shard is
> >>> > doing one
> >>> > delete at a time. Looking at the System Status the latency of a
> single
> >>> > delete on an entity (probably a very simple entity with no composite
> >>> > indexes
> >>> > which would add additional overhead) is approximately 50ms to 100ms
> or
> >>> > so.
> >>> > If we assume 50ms per delete for latency we end up with (assuming no
> >>> > overhead for mapreduce/shard maintenance and spawning additional
> tasks,
> >>> > etc.
> >>> > which would add even more additional time).
> >>> >
> >>> >     300000 entities * .05 seconds per entitiy = 15000 seconds
> >>> >     15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
> >>> > minutes
> >>> >
> >>> > Additionally if a delete takes approximately 100 milliseconds then
> >>> > 300000
> >>> > entities would take 8 hours 20 minutes to complete.
> >>> > Even an unrealistic 25ms per delete is still over two hours.
> >>> >
> >>> > Now remember this is latency (real time) and not CPU time. So even if
> >>> > something has latency time of 50ms it could still eat up 100ms of API
> >>> > CPU
> >>> > time. For example 50ms to delete the entity and 50ms to update the
> >>> > indexes
> >>> > (done in parallel). So if latency time is 4 hours 10 minutes and we
> >>> > just
> >>> > double latency time to approximate API CPU time we get over 8 hours
> of
> >>> > CPU
> >>> > time. If average delete time for your job was 75ms then latency time
> is
> >>> > approximately 6 hours and CPU time 12 hours. Your total was 11 hours
> >>> > billed
> >>> > time so if my logic is sound it seems reasonable the amount you were
> >>> > billed
> >>> > could be correct.
> >>> >
> >>> > Furthermore if we take another look at this from another angle we
> find
> >>> > that
> >>> > if your delete job took 15 minutes to complete then:
> >>> >
> >>> > 300000 entities / 15 minutes = 20000 entities per minute
> >>> > 20000 entities per minute / 60 seconds per minute = 333.33 entities
> per
> >>> > second
> >>> >
> >>> > So, if 333.33 entities are being deleted per second serially then the
> >>> > average latency would be 3ms per delete which seems rather unlikely.
> >>> >
> >>> > My thoughts. Hope it helps (and I hope my math is right),
> >>> > Steve
> >>> >
> >>> > On Sun, Nov 14, 2010 at 2:57 PM, Erik <erik.e.wil...@gmail.com>
> wrote:
> >>> >
> >>> > > On Nov 14, 1:32 pm, Stephen Johnson <onepagewo...@gmail.com>
> wrote:
> >>> > > > Why do you say that's silly? If your map reduce task does bulk
> >>> > > > deletes
> >>> > > and
> >>> > > > let's say they do 100 at a time, then those 100 deletes are done
> in
> >>> > > > parallel. So that's 100x. So for each second of delete real time
> >>> > > > you're
> >>> > > > getting 100 seconds of CPU time.  You should be pleased that
> >>> > > > instead of
> >>> > > your
> >>> > > > task taking 11 hours to delete all your data it took only 15
> >>> > > > minutes.
> >>> > > Isn't
> >>> > > > that scalability? Isn't that what you're looking for? How many
> >>> > > > entities
> >>> > > did
> >>> > > > you delete? How many indexes did you have (composite and single
> >>> > > property)?
> >>> >
> >>> > > This was using only 1 shard per kind that was being deleted, so
> >>> > > effectively there should be no parallelism occurring, unless there
> is
> >>> > > something I am missing?
> >>> > > Deleted about ~300k entities, each with a single indexed
> collection.
> >>> >
> >>> > > > On Sun, Nov 14, 2010 at 10:29 AM, Erik <erik.e.wil...@gmail.com>
> >>> > > > wrote:
> >>> >
> >>> > > > > If you check in the datastore viewer you might be able to find
> >>> > > > > and
> >>> > > > > delete your jobs from one of the tables.  You may also need to
> go
> >>> > > > > into
> >>> > > > > your task queues and purge the default.
> >>> >
> >>> > > > > On this topic, why does deleting data have such a large
> >>> > > > > difference
> >>> > > > > between actual time spent and billed time?
> >>> >
> >>> > > > > For instance, I had two mapreduce shards running to delete
> data,
> >>> > > > > which
> >>> > > > > took a combined a total of 15 minutes, but I was actually
> charged
> >>> > > > > for
> >>> > > > > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> >>> > > > > difference is a little silly!
> >>> >
> >>> > > > > On Nov 14, 4:25 am, Justin <justin.worr...@gmail.com> wrote:
> >>> > > > > > I've been trying to bulk delete data from my application as
> >>> > > > > > described
> >>> > > > > > here
> >>> >
> >>> >
> >>> > > >
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...
> >>> >
> >>> > > > > > This seems to have kicked off a series of mapreduce workers,
> >>> > > > > > whose
> >>> > > > > > execution is killing my CPU - approximately 5 mins later I
> have
> >>> > > > > > reached 100% CPU time and am locked out for the rest of the
> >>> > > > > > day.
> >>> >
> >>> > > > > > I figure I'll just delete by hand; create some appropriate
> >>> > > > > > :delete
> >>> > > > > > controllers and wait till the next day.
> >>> >
> >>> > > > > > Unfortunately the mapreduce process still seems to be running
> -
> >>> > > > > > 10
> >>> > > > > > past midnight and my CPU has reached 100% again.
> >>> >
> >>> > > > > > Is there some way to kill these processes and get back
> control
> >>> > > > > > of my
> >>> > > > > > app?
> >>> >
> >>> > > > > --
> >>> > > > > You received this message because you are subscribed to the
> >>> > > > > Google
> >>> > > Groups
> >>> > > > > "Google App Engine" group.
> >>> > > > > To post to this group, send email to
> >>> > > > > google-appengine@googlegroups.com
> >>> > > .
> >>> > > > > To unsubscribe from this group, send email to
> >>> > > > >
> >>> > > > > google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> <google-appengine%2bunsubscr...@googlegroups.com<google-appengine%252bunsubscr...@googlegroups.com>
> >
> >>> > >
> >>> > > <google-appengine%2bunsubscr...@googlegroups.com<google-appengine%252bunsubscr...@googlegroups.com>
> <google-appengine%252bunsubscr...@googlegroups.com<google-appengine%25252bunsubscr...@googlegroups.com>
> >
> >>> >
> >>> > > > > .
> >>> > > > > For more options, visit this group at
> >>> > > > >http://groups.google.com/group/google-appengine?hl=en.
> >>> >
> >>> > > --
> >>> > > You received this message because you are subscribed to the Google
> >>> > > Groups
> >>> > > "Google App Engine" group.
> >>> > > To post to this group, send email to
> >>> > > google-appeng...@googlegroups.com.
> >>> > > To unsubscribe from this group, send email to
> >>> > >
> >>> > > google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> <google-appengine%2bunsubscr...@googlegroups.com<google-appengine%252bunsubscr...@googlegroups.com>
> >
> >>> > > .
> >>> > > For more options, visit this group at
> >>> > >http://groups.google.com/group/google-appengine?hl=en.
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> Groups
> >>> "Google App Engine" group.
> >>> To post to this group, send email to google-appengine@googlegroups.com
> .
> >>> To unsubscribe from this group, send email to
> >>> google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> .
> >>> For more options, visit this group at
> >>> http://groups.google.com/group/google-appengine?hl=en.
> >>>
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "Google App Engine" group.
> >> To post to this group, send email to google-appeng...@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> .
> >> For more options, visit this group at
> >> http://groups.google.com/group/google-appengine?hl=en.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to google-appeng...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> .
> > For more options, visit this group at
> > http://groups.google.com/group/google-appengine?hl=en.
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appeng...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to