[google-appengine] Re: Bulk data deletion woe

2010-12-06 Thread Beruk
I am having similar problems using the bulkupdate library, which was
sort of a precursor to MapReduce, because bulkupdate iterates over a
query instead of fetching, and I've found that to be buggy and
unreliable:

http://code.google.com/p/googleappengine/issues/detail?id=4046

Could it be that MapReduce uses the iterator interface to the query
(for i in q) instead of fetching batches of entities, which would
explain why your custom delete job, which uses fetch, takes less time
to complete than the MR job?

Pascal


On Nov 15, 1:18 am, Eli Jones  wrote:
> From what I could tell, the map reduce delete job took up several times more
> CPU time (and wall clock time) than my custom delete job usually took.
>
> My usual utility class uses this method for deletes:
>
> 1. Create a query for all entities in a model with keys_only = True.
> 2. Fetch 100 keys.
> 3. Issues a deferred task to delete those 100 key names.
> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
> query returns no more entities.
>
> This is usually pretty fast.. since the only bottle neck is the time it
> takes to fetch 100 key names and add the deferred task.  The surprising fact
> was that the default map reduce delete from the Datastore Admin page took so
> much for CPU.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: Bulk data deletion woe

2010-11-17 Thread Erik
Hi Everyone,

So I have been going through the process of deleting more things and
wanted to share some results.  Previously I had deleted about ~300k
entities each with a single indexed list, with an average cost of
132ms per delete, using the mapreduce method.

Now I realise I need to change my schema again, and needed to delete
288373 entries from a kind, with the following properties:
public class Pw {
@Persistent private String[] a;
@Persistent private String[] b;
}

This time I used the Java remote_api library to batch delete with the
following python code:

class deleter(object):
def run(self, kind, batch_size=200):
q = db.GqlQuery("select __key__ from %s" % (kind))
entities = q.fetch(batch_size)
while entities:
db.delete(entities)
q.with_cursor(q.cursor())
entities = q.fetch(batch_size)

This job completed with 8.8 billed hours, or a 110ms average per
delete.  However I had to restart the deletion several times, as I
would receive an "Unknown Java Error" from remote_api, which
correlated with a spike in the latency of the instance running the
remote_api, and nothing in the dashboard logs.

Later I decided to delete another 288373 entries from a kind, with the
following properties:
public class Pi {
@Persistent private String n;
@Persistent @Extension(vendorName = "datanucleus", key =
"gae.unindexed", value="true") private Long[] a;
@Persistent @Extension(vendorName = "datanucleus", key =
"gae.unindexed", value="true") private Long[] b;
}

This time I uploaded a Python project with the remote_api enabled and
fired up a shell, then deleted those entries with only 3.5 billed
hours, for an average of 44ms per delete!
The delete process using the Python remote_api occurred without
incident, no "Unknown Errors" were received.

Also, interestingly, according to the datastore status during this
time period it should have been costing between 100ms and 500ms per
delete!

Not sure if my 110ms average was due to more indexed properties or
using the Java remote_api?

Hopefully I won't have to delete much else in the future but at least
44ms is much more tolerable.
It would be nice to be able to mark a kind as "lazy delete" and allow
Google to delete the data at their convenience without being billed
(nightly maintenance?).

Cheers!

On Nov 15, 1:18 am, Eli Jones  wrote:
> This is just an anecdotal aside (in other words, I have not bothered to do
> any testing or comparison of performance).. but.. I have my own utility code
> that I use for batch deletes.
>
> Recently, I decided to wipe out all of the entities for one of my models,
> but I was too lazy to look up the exact command I needed to use in the
> remote console.
>
> So, I just used the new Datastore Admin page to delete them.  This page uses
> map reduce jobs to perform deletes.
>
> From what I could tell, the map reduce delete job took up several times more
> CPU time (and wall clock time) than my custom delete job usually took.
>
> My usual utility class uses this method for deletes:
>
> 1. Create a query for all entities in a model with keys_only = True.
> 2. Fetch 100 keys.
> 3. Issues a deferred task to delete those 100 key names.
> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
> query returns no more entities.
>
> This is usually pretty fast.. since the only bottle neck is the time it
> takes to fetch 100 key names and add the deferred task.  The surprising fact
> was that the default map reduce delete from the Datastore Admin page took so
> much for CPU.
>
> So, if you think you'll be doing more bulk deletes in the future, it might
> be useful to compare the CPU usage of a map reduce delete (using keys only
> and not full entities) to a method that deletes batches of 100 key names
> using deferred with a query cursor.
>
> Though, deleting 300,000 entities will take up a lot of CPU hours no matter
> what method you use.
>
> Like I said.. this is anecdotal and there could be no real difference in
> performance.. but the Datastore Admin delete took up way more CPU time than
> it seemed it should have, and I didn't bother to use it or test it again.
>
> On Sun, Nov 14, 2010 at 11:47 PM, Erik  wrote:
> > Thanks for the well thought response, numbers, and reality check
> > Stephen!  That makes a lot of sense when you consider parallel deletes
> > and datastore CPU time.
>
> > On Nov 14, 9:37 pm, Stephen Johnson  wrote:
> > > Thank you for sharing your numbers with us. I think it's a good way for
> > all
> > > of us to get an idea of how much things cost on the cloud, so here's my
> > > thoughts.
>
> > > Even though you had one shard executing the shard should be doing batch
> > > deletes and not one delete at a time. From the documentation batch
> > deletes
> > > can do up to 500 entities in one call and would execute in parallel
> > (perhaps
> > > not 500 all at once but with parallelism none the less). I would assume

[google-appengine] Re: Bulk data deletion woe

2010-11-17 Thread Chris Prinos
Here's what I have done that seems to work:

1) select the entities you want to delete from the datastore admin
page  as you described
2) Manually monitor the cup usage in the dashboard or quota page
3) Once the quota reaches a high value (say 75-80% for example), go to
the "Task Queues" page of the admin console
4) You should see a "default" queue, that contains the worker tasks
that are deleting your data
5) Use the "Pause Queue" button to pause the default queue
6) Wait until your quota is reset (done once per day)
7) With a new quota, go and resume the default task queue.  Repeat
from step 2. If you are deleting less than 1GB of data, you can
probably do this in one or two passes

Note: How high you let the quota go in step #3 above depends on how
much CPU your app normally needs... make sure to leave enough for your
app to function.


On Nov 14, 4:25 am, Justin  wrote:
> I've been trying to bulk delete data from my application as described
> here
>
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> This seems to have kicked off a series of mapreduce workers, whose
> execution is killing my CPU - approximately 5 mins later I have
> reached 100% CPU time and am locked out for the rest of the day.
>
> I figure I'll just delete by hand; create some appropriate :delete
> controllers and wait till the next day.
>
> Unfortunately the mapreduce process still seems to be running - 10
> past midnight and my CPU has reached 100% again.
>
> Is there some way to kill these processes and get back control of my
> app?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: Bulk data deletion woe

2010-11-15 Thread Stephen Johnson
In the Java SDK I think one could subclass the DatastoreRecordReader to do a
keys only query and return null for the entity value and use this class in
lieu of the normal DatastoreRecordReader when needed. Probably similar in
Python.

On Mon, Nov 15, 2010 at 11:47 AM, Robert Kluin wrote:

> In the Python MR libs, there is a DatastoreKeyInputReader input
> reader.  It looks like that is what's used to iterate over the
> entities.
>
> http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148
>
>
>
> Robert
>
>
>
>
>
>
>
>
> On Mon, Nov 15, 2010 at 13:27, Stephen Johnson 
> wrote:
> > Yes I see what you're saying. Map Reduce would bring over the whole
> entity
> > even though it isn't needed and would consume more CPU in fetching the
> > entity not just the key. Seems almost like it would be nice to have an
> > option of Map Reduce only handing off keys and leaving out the entity.
> >
> > On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones  wrote:
> >>
> >> This is just an anecdotal aside (in other words, I have not bothered to
> do
> >> any testing or comparison of performance).. but.. I have my own utility
> code
> >> that I use for batch deletes.
> >> Recently, I decided to wipe out all of the entities for one of my
> models,
> >> but I was too lazy to look up the exact command I needed to use in the
> >> remote console.
> >> So, I just used the new Datastore Admin page to delete them.  This page
> >> uses map reduce jobs to perform deletes.
> >> From what I could tell, the map reduce delete job took up several times
> >> more CPU time (and wall clock time) than my custom delete job usually
> took.
> >> My usual utility class uses this method for deletes:
> >> 1. Create a query for all entities in a model with keys_only = True.
> >> 2. Fetch 100 keys.
> >> 3. Issues a deferred task to delete those 100 key names.
> >> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
> >> query returns no more entities.
> >> This is usually pretty fast.. since the only bottle neck is the time it
> >> takes to fetch 100 key names and add the deferred task.
>  The surprising fact
> >> was that the default map reduce delete from the Datastore Admin page
> took so
> >> much for CPU.
> >> So, if you think you'll be doing more bulk deletes in the future, it
> might
> >> be useful to compare the CPU usage of a map reduce delete (using keys
> only
> >> and not full entities) to a method that deletes batches of 100 key names
> >> using deferred with a query cursor.
> >> Though, deleting 300,000 entities will take up a lot of CPU hours no
> >> matter what method you use.
> >> Like I said.. this is anecdotal and there could be no real difference in
> >> performance.. but the Datastore Admin delete took up way more CPU time
> than
> >> it seemed it should have, and I didn't bother to use it or test it
> again.
> >>
> >> On Sun, Nov 14, 2010 at 11:47 PM, Erik  wrote:
> >>>
> >>> Thanks for the well thought response, numbers, and reality check
> >>> Stephen!  That makes a lot of sense when you consider parallel deletes
> >>> and datastore CPU time.
> >>>
> >>> On Nov 14, 9:37 pm, Stephen Johnson  wrote:
> >>> > Thank you for sharing your numbers with us. I think it's a good way
> for
> >>> > all
> >>> > of us to get an idea of how much things cost on the cloud, so here's
> my
> >>> > thoughts.
> >>> >
> >>> > Even though you had one shard executing the shard should be doing
> batch
> >>> > deletes and not one delete at a time. From the documentation batch
> >>> > deletes
> >>> > can do up to 500 entities in one call and would execute in parallel
> >>> > (perhaps
> >>> > not 500 all at once but with parallelism none the less). I would
> assume
> >>> > the
> >>> > shard would probably do about 100 or so at a time (maybe more / maybe
> >>> > less).
> >>> >
> >>> > Anyway, a good way to prove some parallelism must be occurring would
> be
> >>> > to
> >>> > do a proof by negation. So, let's assume that in fact the shard is
> >>> > doing one
> >>> > delete at a time. Looking at the System Status the latency of a
> single
> >>> > delete on an entity (probably a very simple entity with no composite
> >>> > indexes
> >>> > which would add additional overhead) is approximately 50ms to 100ms
> or
> >>> > so.
> >>> > If we assume 50ms per delete for latency we end up with (assuming no
> >>> > overhead for mapreduce/shard maintenance and spawning additional
> tasks,
> >>> > etc.
> >>> > which would add even more additional time).
> >>> >
> >>> > 30 entities * .05 seconds per entitiy = 15000 seconds
> >>> > 15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
> >>> > minutes
> >>> >
> >>> > Additionally if a delete takes approximately 100 milliseconds then
> >>> > 30
> >>> > entities would take 8 hours 20 minutes to complete.
> >>> > Even an unrealistic 25ms per delete is still over two hours.
> >>> >
> >>> > Now remember this is latency (r

Re: [google-appengine] Re: Bulk data deletion woe

2010-11-15 Thread Robert Kluin
In the Python MR libs, there is a DatastoreKeyInputReader input
reader.  It looks like that is what's used to iterate over the
entities.
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148



Robert








On Mon, Nov 15, 2010 at 13:27, Stephen Johnson  wrote:
> Yes I see what you're saying. Map Reduce would bring over the whole entity
> even though it isn't needed and would consume more CPU in fetching the
> entity not just the key. Seems almost like it would be nice to have an
> option of Map Reduce only handing off keys and leaving out the entity.
>
> On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones  wrote:
>>
>> This is just an anecdotal aside (in other words, I have not bothered to do
>> any testing or comparison of performance).. but.. I have my own utility code
>> that I use for batch deletes.
>> Recently, I decided to wipe out all of the entities for one of my models,
>> but I was too lazy to look up the exact command I needed to use in the
>> remote console.
>> So, I just used the new Datastore Admin page to delete them.  This page
>> uses map reduce jobs to perform deletes.
>> From what I could tell, the map reduce delete job took up several times
>> more CPU time (and wall clock time) than my custom delete job usually took.
>> My usual utility class uses this method for deletes:
>> 1. Create a query for all entities in a model with keys_only = True.
>> 2. Fetch 100 keys.
>> 3. Issues a deferred task to delete those 100 key names.
>> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
>> query returns no more entities.
>> This is usually pretty fast.. since the only bottle neck is the time it
>> takes to fetch 100 key names and add the deferred task.  The surprising fact
>> was that the default map reduce delete from the Datastore Admin page took so
>> much for CPU.
>> So, if you think you'll be doing more bulk deletes in the future, it might
>> be useful to compare the CPU usage of a map reduce delete (using keys only
>> and not full entities) to a method that deletes batches of 100 key names
>> using deferred with a query cursor.
>> Though, deleting 300,000 entities will take up a lot of CPU hours no
>> matter what method you use.
>> Like I said.. this is anecdotal and there could be no real difference in
>> performance.. but the Datastore Admin delete took up way more CPU time than
>> it seemed it should have, and I didn't bother to use it or test it again.
>>
>> On Sun, Nov 14, 2010 at 11:47 PM, Erik  wrote:
>>>
>>> Thanks for the well thought response, numbers, and reality check
>>> Stephen!  That makes a lot of sense when you consider parallel deletes
>>> and datastore CPU time.
>>>
>>> On Nov 14, 9:37 pm, Stephen Johnson  wrote:
>>> > Thank you for sharing your numbers with us. I think it's a good way for
>>> > all
>>> > of us to get an idea of how much things cost on the cloud, so here's my
>>> > thoughts.
>>> >
>>> > Even though you had one shard executing the shard should be doing batch
>>> > deletes and not one delete at a time. From the documentation batch
>>> > deletes
>>> > can do up to 500 entities in one call and would execute in parallel
>>> > (perhaps
>>> > not 500 all at once but with parallelism none the less). I would assume
>>> > the
>>> > shard would probably do about 100 or so at a time (maybe more / maybe
>>> > less).
>>> >
>>> > Anyway, a good way to prove some parallelism must be occurring would be
>>> > to
>>> > do a proof by negation. So, let's assume that in fact the shard is
>>> > doing one
>>> > delete at a time. Looking at the System Status the latency of a single
>>> > delete on an entity (probably a very simple entity with no composite
>>> > indexes
>>> > which would add additional overhead) is approximately 50ms to 100ms or
>>> > so.
>>> > If we assume 50ms per delete for latency we end up with (assuming no
>>> > overhead for mapreduce/shard maintenance and spawning additional tasks,
>>> > etc.
>>> > which would add even more additional time).
>>> >
>>> >     30 entities * .05 seconds per entitiy = 15000 seconds
>>> >     15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
>>> > minutes
>>> >
>>> > Additionally if a delete takes approximately 100 milliseconds then
>>> > 30
>>> > entities would take 8 hours 20 minutes to complete.
>>> > Even an unrealistic 25ms per delete is still over two hours.
>>> >
>>> > Now remember this is latency (real time) and not CPU time. So even if
>>> > something has latency time of 50ms it could still eat up 100ms of API
>>> > CPU
>>> > time. For example 50ms to delete the entity and 50ms to update the
>>> > indexes
>>> > (done in parallel). So if latency time is 4 hours 10 minutes and we
>>> > just
>>> > double latency time to approximate API CPU time we get over 8 hours of
>>> > CPU
>>> > time. If average delete time for your job was 75ms then latency time is
>>> > approximately 6 hours and CPU time 12 hours. Your tot

Re: [google-appengine] Re: Bulk data deletion woe

2010-11-15 Thread Stephen Johnson
Yes I see what you're saying. Map Reduce would bring over the whole entity
even though it isn't needed and would consume more CPU in fetching the
entity not just the key. Seems almost like it would be nice to have an
option of Map Reduce only handing off keys and leaving out the entity.

On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones  wrote:

> This is just an anecdotal aside (in other words, I have not bothered to do
> any testing or comparison of performance).. but.. I have my own utility code
> that I use for batch deletes.
>
> Recently, I decided to wipe out all of the entities for one of my models,
> but I was too lazy to look up the exact command I needed to use in the
> remote console.
>
> So, I just used the new Datastore Admin page to delete them.  This page
> uses map reduce jobs to perform deletes.
>
> From what I could tell, the map reduce delete job took up several times
> more CPU time (and wall clock time) than my custom delete job usually took.
>
> My usual utility class uses this method for deletes:
>
> 1. Create a query for all entities in a model with keys_only = True.
> 2. Fetch 100 keys.
> 3. Issues a deferred task to delete those 100 key names.
> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
> query returns no more entities.
>
> This is usually pretty fast.. since the only bottle neck is the time it
> takes to fetch 100 key names and add the deferred task.  The surprising fact
> was that the default map reduce delete from the Datastore Admin page took so
> much for CPU.
>
> So, if you think you'll be doing more bulk deletes in the future, it might
> be useful to compare the CPU usage of a map reduce delete (using keys only
> and not full entities) to a method that deletes batches of 100 key names
> using deferred with a query cursor.
>
> Though, deleting 300,000 entities will take up a lot of CPU hours no matter
> what method you use.
>
> Like I said.. this is anecdotal and there could be no real difference in
> performance.. but the Datastore Admin delete took up way more CPU time than
> it seemed it should have, and I didn't bother to use it or test it again.
>
>
> On Sun, Nov 14, 2010 at 11:47 PM, Erik  wrote:
>
>> Thanks for the well thought response, numbers, and reality check
>> Stephen!  That makes a lot of sense when you consider parallel deletes
>> and datastore CPU time.
>>
>> On Nov 14, 9:37 pm, Stephen Johnson  wrote:
>> > Thank you for sharing your numbers with us. I think it's a good way for
>> all
>> > of us to get an idea of how much things cost on the cloud, so here's my
>> > thoughts.
>> >
>> > Even though you had one shard executing the shard should be doing batch
>> > deletes and not one delete at a time. From the documentation batch
>> deletes
>> > can do up to 500 entities in one call and would execute in parallel
>> (perhaps
>> > not 500 all at once but with parallelism none the less). I would assume
>> the
>> > shard would probably do about 100 or so at a time (maybe more / maybe
>> less).
>> >
>> > Anyway, a good way to prove some parallelism must be occurring would be
>> to
>> > do a proof by negation. So, let's assume that in fact the shard is doing
>> one
>> > delete at a time. Looking at the System Status the latency of a single
>> > delete on an entity (probably a very simple entity with no composite
>> indexes
>> > which would add additional overhead) is approximately 50ms to 100ms or
>> so.
>> > If we assume 50ms per delete for latency we end up with (assuming no
>> > overhead for mapreduce/shard maintenance and spawning additional tasks,
>> etc.
>> > which would add even more additional time).
>> >
>> > 30 entities * .05 seconds per entitiy = 15000 seconds
>> > 15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
>> > minutes
>> >
>> > Additionally if a delete takes approximately 100 milliseconds then
>> 30
>> > entities would take 8 hours 20 minutes to complete.
>> > Even an unrealistic 25ms per delete is still over two hours.
>> >
>> > Now remember this is latency (real time) and not CPU time. So even if
>> > something has latency time of 50ms it could still eat up 100ms of API
>> CPU
>> > time. For example 50ms to delete the entity and 50ms to update the
>> indexes
>> > (done in parallel). So if latency time is 4 hours 10 minutes and we just
>> > double latency time to approximate API CPU time we get over 8 hours of
>> CPU
>> > time. If average delete time for your job was 75ms then latency time is
>> > approximately 6 hours and CPU time 12 hours. Your total was 11 hours
>> billed
>> > time so if my logic is sound it seems reasonable the amount you were
>> billed
>> > could be correct.
>> >
>> > Furthermore if we take another look at this from another angle we find
>> that
>> > if your delete job took 15 minutes to complete then:
>> >
>> > 30 entities / 15 minutes = 2 entities per minute
>> > 2 entities per minute / 60 seconds per minute = 333.33 entities per
>> > second
>> >
>> > So, 

Re: [google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Eli Jones
This is just an anecdotal aside (in other words, I have not bothered to do
any testing or comparison of performance).. but.. I have my own utility code
that I use for batch deletes.

Recently, I decided to wipe out all of the entities for one of my models,
but I was too lazy to look up the exact command I needed to use in the
remote console.

So, I just used the new Datastore Admin page to delete them.  This page uses
map reduce jobs to perform deletes.

>From what I could tell, the map reduce delete job took up several times more
CPU time (and wall clock time) than my custom delete job usually took.

My usual utility class uses this method for deletes:

1. Create a query for all entities in a model with keys_only = True.
2. Fetch 100 keys.
3. Issues a deferred task to delete those 100 key names.
4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
query returns no more entities.

This is usually pretty fast.. since the only bottle neck is the time it
takes to fetch 100 key names and add the deferred task.  The surprising fact
was that the default map reduce delete from the Datastore Admin page took so
much for CPU.

So, if you think you'll be doing more bulk deletes in the future, it might
be useful to compare the CPU usage of a map reduce delete (using keys only
and not full entities) to a method that deletes batches of 100 key names
using deferred with a query cursor.

Though, deleting 300,000 entities will take up a lot of CPU hours no matter
what method you use.

Like I said.. this is anecdotal and there could be no real difference in
performance.. but the Datastore Admin delete took up way more CPU time than
it seemed it should have, and I didn't bother to use it or test it again.

On Sun, Nov 14, 2010 at 11:47 PM, Erik  wrote:

> Thanks for the well thought response, numbers, and reality check
> Stephen!  That makes a lot of sense when you consider parallel deletes
> and datastore CPU time.
>
> On Nov 14, 9:37 pm, Stephen Johnson  wrote:
> > Thank you for sharing your numbers with us. I think it's a good way for
> all
> > of us to get an idea of how much things cost on the cloud, so here's my
> > thoughts.
> >
> > Even though you had one shard executing the shard should be doing batch
> > deletes and not one delete at a time. From the documentation batch
> deletes
> > can do up to 500 entities in one call and would execute in parallel
> (perhaps
> > not 500 all at once but with parallelism none the less). I would assume
> the
> > shard would probably do about 100 or so at a time (maybe more / maybe
> less).
> >
> > Anyway, a good way to prove some parallelism must be occurring would be
> to
> > do a proof by negation. So, let's assume that in fact the shard is doing
> one
> > delete at a time. Looking at the System Status the latency of a single
> > delete on an entity (probably a very simple entity with no composite
> indexes
> > which would add additional overhead) is approximately 50ms to 100ms or
> so.
> > If we assume 50ms per delete for latency we end up with (assuming no
> > overhead for mapreduce/shard maintenance and spawning additional tasks,
> etc.
> > which would add even more additional time).
> >
> > 30 entities * .05 seconds per entitiy = 15000 seconds
> > 15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
> > minutes
> >
> > Additionally if a delete takes approximately 100 milliseconds then 30
> > entities would take 8 hours 20 minutes to complete.
> > Even an unrealistic 25ms per delete is still over two hours.
> >
> > Now remember this is latency (real time) and not CPU time. So even if
> > something has latency time of 50ms it could still eat up 100ms of API CPU
> > time. For example 50ms to delete the entity and 50ms to update the
> indexes
> > (done in parallel). So if latency time is 4 hours 10 minutes and we just
> > double latency time to approximate API CPU time we get over 8 hours of
> CPU
> > time. If average delete time for your job was 75ms then latency time is
> > approximately 6 hours and CPU time 12 hours. Your total was 11 hours
> billed
> > time so if my logic is sound it seems reasonable the amount you were
> billed
> > could be correct.
> >
> > Furthermore if we take another look at this from another angle we find
> that
> > if your delete job took 15 minutes to complete then:
> >
> > 30 entities / 15 minutes = 2 entities per minute
> > 2 entities per minute / 60 seconds per minute = 333.33 entities per
> > second
> >
> > So, if 333.33 entities are being deleted per second serially then the
> > average latency would be 3ms per delete which seems rather unlikely.
> >
> > My thoughts. Hope it helps (and I hope my math is right),
> > Steve
> >
> > On Sun, Nov 14, 2010 at 2:57 PM, Erik  wrote:
> >
> > > On Nov 14, 1:32 pm, Stephen Johnson  wrote:
> > > > Why do you say that's silly? If your map reduce task does bulk
> deletes
> > > and
> > > > let's say they do 100 at a time, then those 100 deletes are don

[google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Erik
Thanks for the well thought response, numbers, and reality check
Stephen!  That makes a lot of sense when you consider parallel deletes
and datastore CPU time.

On Nov 14, 9:37 pm, Stephen Johnson  wrote:
> Thank you for sharing your numbers with us. I think it's a good way for all
> of us to get an idea of how much things cost on the cloud, so here's my
> thoughts.
>
> Even though you had one shard executing the shard should be doing batch
> deletes and not one delete at a time. From the documentation batch deletes
> can do up to 500 entities in one call and would execute in parallel (perhaps
> not 500 all at once but with parallelism none the less). I would assume the
> shard would probably do about 100 or so at a time (maybe more / maybe less).
>
> Anyway, a good way to prove some parallelism must be occurring would be to
> do a proof by negation. So, let's assume that in fact the shard is doing one
> delete at a time. Looking at the System Status the latency of a single
> delete on an entity (probably a very simple entity with no composite indexes
> which would add additional overhead) is approximately 50ms to 100ms or so.
> If we assume 50ms per delete for latency we end up with (assuming no
> overhead for mapreduce/shard maintenance and spawning additional tasks, etc.
> which would add even more additional time).
>
>     30 entities * .05 seconds per entitiy = 15000 seconds
>     15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
> minutes
>
> Additionally if a delete takes approximately 100 milliseconds then 30
> entities would take 8 hours 20 minutes to complete.
> Even an unrealistic 25ms per delete is still over two hours.
>
> Now remember this is latency (real time) and not CPU time. So even if
> something has latency time of 50ms it could still eat up 100ms of API CPU
> time. For example 50ms to delete the entity and 50ms to update the indexes
> (done in parallel). So if latency time is 4 hours 10 minutes and we just
> double latency time to approximate API CPU time we get over 8 hours of CPU
> time. If average delete time for your job was 75ms then latency time is
> approximately 6 hours and CPU time 12 hours. Your total was 11 hours billed
> time so if my logic is sound it seems reasonable the amount you were billed
> could be correct.
>
> Furthermore if we take another look at this from another angle we find that
> if your delete job took 15 minutes to complete then:
>
> 30 entities / 15 minutes = 2 entities per minute
> 2 entities per minute / 60 seconds per minute = 333.33 entities per
> second
>
> So, if 333.33 entities are being deleted per second serially then the
> average latency would be 3ms per delete which seems rather unlikely.
>
> My thoughts. Hope it helps (and I hope my math is right),
> Steve
>
> On Sun, Nov 14, 2010 at 2:57 PM, Erik  wrote:
>
> > On Nov 14, 1:32 pm, Stephen Johnson  wrote:
> > > Why do you say that's silly? If your map reduce task does bulk deletes
> > and
> > > let's say they do 100 at a time, then those 100 deletes are done in
> > > parallel. So that's 100x. So for each second of delete real time you're
> > > getting 100 seconds of CPU time.  You should be pleased that instead of
> > your
> > > task taking 11 hours to delete all your data it took only 15 minutes.
> > Isn't
> > > that scalability? Isn't that what you're looking for? How many entities
> > did
> > > you delete? How many indexes did you have (composite and single
> > property)?
>
> > This was using only 1 shard per kind that was being deleted, so
> > effectively there should be no parallelism occurring, unless there is
> > something I am missing?
> > Deleted about ~300k entities, each with a single indexed collection.
>
> > > On Sun, Nov 14, 2010 at 10:29 AM, Erik  wrote:
>
> > > > If you check in the datastore viewer you might be able to find and
> > > > delete your jobs from one of the tables.  You may also need to go into
> > > > your task queues and purge the default.
>
> > > > On this topic, why does deleting data have such a large difference
> > > > between actual time spent and billed time?
>
> > > > For instance, I had two mapreduce shards running to delete data, which
> > > > took a combined a total of 15 minutes, but I was actually charged for
> > > > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> > > > difference is a little silly!
>
> > > > On Nov 14, 4:25 am, Justin  wrote:
> > > > > I've been trying to bulk delete data from my application as described
> > > > > here
>
> >http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> > > > > This seems to have kicked off a series of mapreduce workers, whose
> > > > > execution is killing my CPU - approximately 5 mins later I have
> > > > > reached 100% CPU time and am locked out for the rest of the day.
>
> > > > > I figure I'll just delete by hand; create some appropriate :delete
> > > > > controllers and wait till the next day.
>
> > > > > Unfortunately the mapredu

Re: [google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Stephen Johnson
Thank you for sharing your numbers with us. I think it's a good way for all
of us to get an idea of how much things cost on the cloud, so here's my
thoughts.

Even though you had one shard executing the shard should be doing batch
deletes and not one delete at a time. From the documentation batch deletes
can do up to 500 entities in one call and would execute in parallel (perhaps
not 500 all at once but with parallelism none the less). I would assume the
shard would probably do about 100 or so at a time (maybe more / maybe less).

Anyway, a good way to prove some parallelism must be occurring would be to
do a proof by negation. So, let's assume that in fact the shard is doing one
delete at a time. Looking at the System Status the latency of a single
delete on an entity (probably a very simple entity with no composite indexes
which would add additional overhead) is approximately 50ms to 100ms or so.
If we assume 50ms per delete for latency we end up with (assuming no
overhead for mapreduce/shard maintenance and spawning additional tasks, etc.
which would add even more additional time).

30 entities * .05 seconds per entitiy = 15000 seconds
15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
minutes

Additionally if a delete takes approximately 100 milliseconds then 30
entities would take 8 hours 20 minutes to complete.
Even an unrealistic 25ms per delete is still over two hours.

Now remember this is latency (real time) and not CPU time. So even if
something has latency time of 50ms it could still eat up 100ms of API CPU
time. For example 50ms to delete the entity and 50ms to update the indexes
(done in parallel). So if latency time is 4 hours 10 minutes and we just
double latency time to approximate API CPU time we get over 8 hours of CPU
time. If average delete time for your job was 75ms then latency time is
approximately 6 hours and CPU time 12 hours. Your total was 11 hours billed
time so if my logic is sound it seems reasonable the amount you were billed
could be correct.

Furthermore if we take another look at this from another angle we find that
if your delete job took 15 minutes to complete then:

30 entities / 15 minutes = 2 entities per minute
2 entities per minute / 60 seconds per minute = 333.33 entities per
second

So, if 333.33 entities are being deleted per second serially then the
average latency would be 3ms per delete which seems rather unlikely.

My thoughts. Hope it helps (and I hope my math is right),
Steve


On Sun, Nov 14, 2010 at 2:57 PM, Erik  wrote:

>
>
> On Nov 14, 1:32 pm, Stephen Johnson  wrote:
> > Why do you say that's silly? If your map reduce task does bulk deletes
> and
> > let's say they do 100 at a time, then those 100 deletes are done in
> > parallel. So that's 100x. So for each second of delete real time you're
> > getting 100 seconds of CPU time.  You should be pleased that instead of
> your
> > task taking 11 hours to delete all your data it took only 15 minutes.
> Isn't
> > that scalability? Isn't that what you're looking for? How many entities
> did
> > you delete? How many indexes did you have (composite and single
> property)?
>
> This was using only 1 shard per kind that was being deleted, so
> effectively there should be no parallelism occurring, unless there is
> something I am missing?
> Deleted about ~300k entities, each with a single indexed collection.
>
> > On Sun, Nov 14, 2010 at 10:29 AM, Erik  wrote:
> >
> > > If you check in the datastore viewer you might be able to find and
> > > delete your jobs from one of the tables.  You may also need to go into
> > > your task queues and purge the default.
> >
> > > On this topic, why does deleting data have such a large difference
> > > between actual time spent and billed time?
> >
> > > For instance, I had two mapreduce shards running to delete data, which
> > > took a combined a total of 15 minutes, but I was actually charged for
> > > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> > > difference is a little silly!
> >
> > > On Nov 14, 4:25 am, Justin  wrote:
> > > > I've been trying to bulk delete data from my application as described
> > > > here
> >
> > > >
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...
> >
> > > > This seems to have kicked off a series of mapreduce workers, whose
> > > > execution is killing my CPU - approximately 5 mins later I have
> > > > reached 100% CPU time and am locked out for the rest of the day.
> >
> > > > I figure I'll just delete by hand; create some appropriate :delete
> > > > controllers and wait till the next day.
> >
> > > > Unfortunately the mapreduce process still seems to be running - 10
> > > > past midnight and my CPU has reached 100% again.
> >
> > > > Is there some way to kill these processes and get back control of my
> > > > app?
> >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "Google App Engine" group.
> > > To post to this

[google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Erik


On Nov 14, 1:32 pm, Stephen Johnson  wrote:
> Why do you say that's silly? If your map reduce task does bulk deletes and
> let's say they do 100 at a time, then those 100 deletes are done in
> parallel. So that's 100x. So for each second of delete real time you're
> getting 100 seconds of CPU time.  You should be pleased that instead of your
> task taking 11 hours to delete all your data it took only 15 minutes. Isn't
> that scalability? Isn't that what you're looking for? How many entities did
> you delete? How many indexes did you have (composite and single property)?

This was using only 1 shard per kind that was being deleted, so
effectively there should be no parallelism occurring, unless there is
something I am missing?
Deleted about ~300k entities, each with a single indexed collection.

> On Sun, Nov 14, 2010 at 10:29 AM, Erik  wrote:
>
> > If you check in the datastore viewer you might be able to find and
> > delete your jobs from one of the tables.  You may also need to go into
> > your task queues and purge the default.
>
> > On this topic, why does deleting data have such a large difference
> > between actual time spent and billed time?
>
> > For instance, I had two mapreduce shards running to delete data, which
> > took a combined a total of 15 minutes, but I was actually charged for
> > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> > difference is a little silly!
>
> > On Nov 14, 4:25 am, Justin  wrote:
> > > I've been trying to bulk delete data from my application as described
> > > here
>
> > >http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> > > This seems to have kicked off a series of mapreduce workers, whose
> > > execution is killing my CPU - approximately 5 mins later I have
> > > reached 100% CPU time and am locked out for the rest of the day.
>
> > > I figure I'll just delete by hand; create some appropriate :delete
> > > controllers and wait till the next day.
>
> > > Unfortunately the mapreduce process still seems to be running - 10
> > > past midnight and my CPU has reached 100% again.
>
> > > Is there some way to kill these processes and get back control of my
> > > app?
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to google-appeng...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengine+unsubscr...@googlegroups.com
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Jeff Schwartz
I think there might have been a misunderstanding regarding parallel seconds
of api time & how that actually plays out as far as billing is concerned.
With that confusion out of the way I'd answer your question regarding the
merits of AppEngine's scalability with a big YES. Like my mom used to say,
"nothing in life is really free so why should this one be?"

:)

Jeff

On Sun, Nov 14, 2010 at 1:32 PM, Stephen Johnson wrote:

> Why do you say that's silly? If your map reduce task does bulk deletes and
> let's say they do 100 at a time, then those 100 deletes are done in
> parallel. So that's 100x. So for each second of delete real time you're
> getting 100 seconds of CPU time.  You should be pleased that instead of your
> task taking 11 hours to delete all your data it took only 15 minutes. Isn't
> that scalability? Isn't that what you're looking for? How many entities did
> you delete? How many indexes did you have (composite and single property)?
>
> On Sun, Nov 14, 2010 at 10:29 AM, Erik  wrote:
>
>>
>> If you check in the datastore viewer you might be able to find and
>> delete your jobs from one of the tables.  You may also need to go into
>> your task queues and purge the default.
>>
>> On this topic, why does deleting data have such a large difference
>> between actual time spent and billed time?
>>
>> For instance, I had two mapreduce shards running to delete data, which
>> took a combined a total of 15 minutes, but I was actually charged for
>> 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
>> difference is a little silly!
>>
>>
>> On Nov 14, 4:25 am, Justin  wrote:
>> > I've been trying to bulk delete data from my application as described
>> > here
>> >
>> > http://code.google.com/appengine/docs/python/datastore/creatinggettin.
>> ..
>> >
>> > This seems to have kicked off a series of mapreduce workers, whose
>> > execution is killing my CPU - approximately 5 mins later I have
>> > reached 100% CPU time and am locked out for the rest of the day.
>> >
>> > I figure I'll just delete by hand; create some appropriate :delete
>> > controllers and wait till the next day.
>> >
>> > Unfortunately the mapreduce process still seems to be running - 10
>> > past midnight and my CPU has reached 100% again.
>> >
>> > Is there some way to kill these processes and get back control of my
>> > app?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to google-appeng...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> google-appengine+unsubscr...@googlegroups.com
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appeng...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>



-- 
*Jeff Schwartz*

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Stephen Johnson
Why do you say that's silly? If your map reduce task does bulk deletes and
let's say they do 100 at a time, then those 100 deletes are done in
parallel. So that's 100x. So for each second of delete real time you're
getting 100 seconds of CPU time.  You should be pleased that instead of your
task taking 11 hours to delete all your data it took only 15 minutes. Isn't
that scalability? Isn't that what you're looking for? How many entities did
you delete? How many indexes did you have (composite and single property)?

On Sun, Nov 14, 2010 at 10:29 AM, Erik  wrote:

>
> If you check in the datastore viewer you might be able to find and
> delete your jobs from one of the tables.  You may also need to go into
> your task queues and purge the default.
>
> On this topic, why does deleting data have such a large difference
> between actual time spent and billed time?
>
> For instance, I had two mapreduce shards running to delete data, which
> took a combined a total of 15 minutes, but I was actually charged for
> 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> difference is a little silly!
>
>
> On Nov 14, 4:25 am, Justin  wrote:
> > I've been trying to bulk delete data from my application as described
> > here
> >
> > http://code.google.com/appengine/docs/python/datastore/creatinggettin...
> >
> > This seems to have kicked off a series of mapreduce workers, whose
> > execution is killing my CPU - approximately 5 mins later I have
> > reached 100% CPU time and am locked out for the rest of the day.
> >
> > I figure I'll just delete by hand; create some appropriate :delete
> > controllers and wait till the next day.
> >
> > Unfortunately the mapreduce process still seems to be running - 10
> > past midnight and my CPU has reached 100% again.
> >
> > Is there some way to kill these processes and get back control of my
> > app?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appeng...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Justin
I can't get into the datastore viewer - whenever the CPU quota is
exceeded all the datastore admin interfaces return http 503 -
something else to fix ?

On Nov 14, 5:29 pm, Erik  wrote:
> If you check in the datastore viewer you might be able to find and
> delete your jobs from one of the tables.  You may also need to go into
> your task queues and purge the default.
>
> On this topic, why does deleting data have such a large difference
> between actual time spent and billed time?
>
> For instance, I had two mapreduce shards running to delete data, which
> took a combined a total of 15 minutes, but I was actually charged for
> 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
> difference is a little silly!
>
> On Nov 14, 4:25 am, Justin  wrote:
>
>
>
> > I've been trying to bulk delete data from my application as described
> > here
>
> >http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> > This seems to have kicked off a series of mapreduce workers, whose
> > execution is killing my CPU - approximately 5 mins later I have
> > reached 100% CPU time and am locked out for the rest of the day.
>
> > I figure I'll just delete by hand; create some appropriate :delete
> > controllers and wait till the next day.
>
> > Unfortunately the mapreduce process still seems to be running - 10
> > past midnight and my CPU has reached 100% again.
>
> > Is there some way to kill these processes and get back control of my
> > app?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: Bulk data deletion woe

2010-11-14 Thread Erik

If you check in the datastore viewer you might be able to find and
delete your jobs from one of the tables.  You may also need to go into
your task queues and purge the default.

On this topic, why does deleting data have such a large difference
between actual time spent and billed time?

For instance, I had two mapreduce shards running to delete data, which
took a combined a total of 15 minutes, but I was actually charged for
11(!) hours.  I know there isn't a 1:1 correlation but a >40x
difference is a little silly!


On Nov 14, 4:25 am, Justin  wrote:
> I've been trying to bulk delete data from my application as described
> here
>
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>
> This seems to have kicked off a series of mapreduce workers, whose
> execution is killing my CPU - approximately 5 mins later I have
> reached 100% CPU time and am locked out for the rest of the day.
>
> I figure I'll just delete by hand; create some appropriate :delete
> controllers and wait till the next day.
>
> Unfortunately the mapreduce process still seems to be running - 10
> past midnight and my CPU has reached 100% again.
>
> Is there some way to kill these processes and get back control of my
> app?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.