Re: [google-appengine] Re: ~7 GB of ghost data???
Hi, On Tue, Mar 23, 2010 at 10:25 AM, homunq jameson.qu...@gmail.com wrote: On Mar 22, 3:48 pm, Nick Johnson (Google) nick.john...@google.com wrote On Mon, Mar 22, 2010 at 8:45 PM, homunq jameson.qu...@gmail.com wrote: OK, after hashing it out on IRC, I see that I have to erase my data and start again. Why is that? Wouldn't updating the data be a better option? Because everything about it is wrong for saving space - the key names, the field names, the indexes, and even in one case the fact of breaking a string out into a list. (something I did for better searching in several cases, one of which is not worth it now I realize that 10X is easy to hit.) And because the data import runs smoothly, and I have code for that already. Watching my deletion process start to get trapped in molasses, as Eli Jones mentions above, I have to ask two things again: 1. Is there ANY ANY way to delete all indexes on a given property name? Without worrying about keeping indexes in order when I'm just paring them down to 0, I'd just be running through key names and deleting them. It seems that would be much faster. (If it's any help, I strongly suspect that most of my key names are globally unique across all of Google). No - that would violate the constant that indexes are always kept in sync with the data they refer to. 2. What is the reason for the slowdown? If I understand his suggestion to delete every 10th record, Eli Jones seems to suspect that it's because there's some kind of resource conflict on specific sections of storage, thus the solution is to attempt to spread your load across machines. I don't see why that would cause a gradual slowdown. My best theory is that write-then-delete leaves the index somehow a little messier (for instance, maybe the index doesn't fully recover the unused space because it expects you to fill it again) and that when you do it on a massive scale you get massively messy and slow indexes. Thus, again, I suspect this question reduces to question 1, although I guess that if my theory is right a compress/garbage-collect/degunking call for the indexes would be (for me) second best after a way to nuke them. Deletes using the naive approach slow down because when a record is deleted in Bigtable, it simply inserts a 'tombstone' record indicating the original record is deleted - the record isn't actually removed entirely from the datastore until the tablet it's on does its next compaction cycle. Until then, every subsequent query has to skip over the tombstone records to find the live records. This is easy to avoid: Use cursors to delete records sequentially. That way, your queries won't be skipping the same tombstoned records over and over again - O(n) instead of O(n^2)! -Nick Johnson -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: ~7 GB of ghost data???
On Tue, Mar 23, 2010 at 1:57 PM, homunq jameson.qu...@gmail.com wrote: Watching my deletion process start to get trapped in molasses, as Eli Jones mentions above, I have to ask two things again: 1. Is there ANY ANY way to delete all indexes on a given property name? Without worrying about keeping indexes in order when I'm just paring them down to 0, I'd just be running through key names and deleting them. It seems that would be much faster. (If it's any help, I strongly suspect that most of my key names are globally unique across all of Google). No - that would violate the constant that indexes are always kept in sync with the data they refer to. It seems to me that having no index at all is the same situation as if the property was indexed=False from the beginning. If that's so, it can't be violating a hard constraint. Internally, indexed fields are stored in the 'properties' list in the Entity Protocol Buffer, while unindexed fields are stored in the 'unindexed_properties' list in the Entity PB. The only way to change the indexing properties is to fetch them and store them. 2. What is the reason for the slowdown? If I understand his suggestion to delete every 10th record, Eli Jones seems to suspect that it's because there's some kind of resource conflict on specific sections of storage, thus the solution is to attempt to spread your load across machines. I don't see why that would cause a gradual slowdown. My best theory is that write-then-delete leaves the index somehow a little messier (for instance, maybe the index doesn't fully recover the unused space because it expects you to fill it again) and that when you do it on a massive scale you get massively messy and slow indexes. Thus, again, I suspect this question reduces to question 1, although I guess that if my theory is right a compress/garbage-collect/degunking call for the indexes would be (for me) second best after a way to nuke them. Deletes using the naive approach slow down because when a record is deleted in Bigtable, it simply inserts a 'tombstone' record indicating the original record is deleted - the record isn't actually removed entirely from the datastore until the tablet it's on does its next compaction cycle. Until then, every subsequent query has to skip over the tombstone records to find the live records. This is easy to avoid: Use cursors to delete records sequentially. That way, your queries won't be skipping the same tombstoned records over and over again - O(n) instead of O(n^2)! Thanks for explaining. Can you say anything about how often the compaction cycles are? Just an order of magnitude - hours, days, or weeks? They're based on the quantity of modifications to data in a given tablet. Doing many inserts, updates or deletes will, sooner or later, cause a compaction. -Nick Johnson Thanks, Jameson -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: ~7 GB of ghost data???
Hey Nick, Just out of curiosity, how many properties would it take to get that amount of wasted space in overhead? Are we talking about entities in the orders of magnitudes of tens/thousands/hundreds? On Mon, Mar 22, 2010 at 9:07 AM, homunq jameson.qu...@gmail.com wrote: OK, I guess I'm guilty on all counts. Clearly, I can fix that moving forward, though it will cost me a lot of CPU to fix the data I've already entered. But as a short-term stopgap, is there any way to delete entire default indexes for a given property? (I mean, anything besides setting indexed=False and then touching each entity one-by-one). You can vacuum custom indexes - can you do it with indexes created by default? Thanks, Jameson On 22 mar, 03:42, Nick Johnson (Google) nick.john...@google.com wrote: Hi, The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation? -Nick Johnson On Sun, Mar 21, 2010 at 3:39 AM, homunq jameson.qu...@gmail.com wrote: Something is wrong. My app is showing with 7.42GB of total stored data, but only 615 MB of datastore. There is only one version string uploaded, which is almost 150MB, and nothing in the blobstore. This discrepancy has been getting worse - several hours ago (longer than the period since datastore statistics were updated, if you're wondering), there were the same 615 MB in the datastore, and only 3.09GB of total stored data. (at that time, my theory was that it was old uploads of tweaks to the same version - but the numbers have gone far, far beyond that explanation now.) It's not some exploding index; the only non-default index I have is on an entity type with just 33 entities. Here's the line from my dashboard: Total Stored Data$0.005/GByte-day82% 7.42 of 9.00 GBytes $0.04 / $0.04 And here is the word from my datastore statistics: Last updatedTotal number of entitiesSize of all entities 1:32:13 ago 232,867 615 MBytes (metadata 11%, if that matters) Please, can someone help me figure out this issue? I'd be happy to share any info or code which would help track this down. My app id is vulahealth. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com . To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.comgoogle-appengine%2Bunsubscrib e...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Patrick H. Twohig. Namazu Studios P.O. Box 34161 San Diego, CA 92163-4161 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: ~7 GB of ghost data???
Hi Patrick, An overhead factor of 12 (as observed below) is high, but not outrageous. With long model names and property names, this could happen with relatively few indexed properties - on the order of magnitude of tens, at most. -Nick Johnson On Mon, Mar 22, 2010 at 8:07 PM, Patrick Twohig patr...@namazustudios.comwrote: Hey Nick, Just out of curiosity, how many properties would it take to get that amount of wasted space in overhead? Are we talking about entities in the orders of magnitudes of tens/thousands/hundreds? On Mon, Mar 22, 2010 at 9:07 AM, homunq jameson.qu...@gmail.com wrote: OK, I guess I'm guilty on all counts. Clearly, I can fix that moving forward, though it will cost me a lot of CPU to fix the data I've already entered. But as a short-term stopgap, is there any way to delete entire default indexes for a given property? (I mean, anything besides setting indexed=False and then touching each entity one-by-one). You can vacuum custom indexes - can you do it with indexes created by default? Thanks, Jameson On 22 mar, 03:42, Nick Johnson (Google) nick.john...@google.com wrote: Hi, The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation? -Nick Johnson On Sun, Mar 21, 2010 at 3:39 AM, homunq jameson.qu...@gmail.com wrote: Something is wrong. My app is showing with 7.42GB of total stored data, but only 615 MB of datastore. There is only one version string uploaded, which is almost 150MB, and nothing in the blobstore. This discrepancy has been getting worse - several hours ago (longer than the period since datastore statistics were updated, if you're wondering), there were the same 615 MB in the datastore, and only 3.09GB of total stored data. (at that time, my theory was that it was old uploads of tweaks to the same version - but the numbers have gone far, far beyond that explanation now.) It's not some exploding index; the only non-default index I have is on an entity type with just 33 entities. Here's the line from my dashboard: Total Stored Data$0.005/GByte-day82% 7.42 of 9.00 GBytes $0.04 / $0.04 And here is the word from my datastore statistics: Last updatedTotal number of entitiesSize of all entities 1:32:13 ago 232,867 615 MBytes (metadata 11%, if that matters) Please, can someone help me figure out this issue? I'd be happy to share any info or code which would help track this down. My app id is vulahealth. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.comgoogle-appengine%2Bunsubscrib e...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Patrick H. Twohig. Namazu Studios P.O. Box 34161 San Diego, CA 92163-4161 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: ~7 GB of ghost data???
I'd use a cursor on the task queue. Do bulk deletes in blocks of 500 (I think that's the most keys you can pass to delete on a single call) and it shouldn't be that hard to wipe it out. Cheers! On Mon, Mar 22, 2010 at 1:45 PM, homunq jameson.qu...@gmail.com wrote: OK, after hashing it out on IRC, I see that I have to erase my data and start again. Since it took me 3 days of CPU quota to add the data, I want to know if I can erase it quickly. 1. Is the overhead for erasing data (and thus whittling down indexes) over half the overhead from adding it? Under 10%? Or what? (I don't need exact numbers, just approximates. 2. If it's more like half - is there some way to just nuke all my data and start over? Thanks, Jameson On 22 mar, 03:42, Nick Johnson (Google) nick.john...@google.com wrote: Hi, The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation? -Nick Johnson On Sun, Mar 21, 2010 at 3:39 AM, homunq jameson.qu...@gmail.com wrote: Something is wrong. My app is showing with 7.42GB of total stored data, but only 615 MB of datastore. There is only one version string uploaded, which is almost 150MB, and nothing in the blobstore. This discrepancy has been getting worse - several hours ago (longer than the period since datastore statistics were updated, if you're wondering), there were the same 615 MB in the datastore, and only 3.09GB of total stored data. (at that time, my theory was that it was old uploads of tweaks to the same version - but the numbers have gone far, far beyond that explanation now.) It's not some exploding index; the only non-default index I have is on an entity type with just 33 entities. Here's the line from my dashboard: Total Stored Data$0.005/GByte-day82% 7.42 of 9.00 GBytes $0.04 / $0.04 And here is the word from my datastore statistics: Last updatedTotal number of entitiesSize of all entities 1:32:13 ago 232,867 615 MBytes (metadata 11%, if that matters) Please, can someone help me figure out this issue? I'd be happy to share any info or code which would help track this down. My app id is vulahealth. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com . To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.comgoogle-appengine%2Bunsubscrib e...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Patrick H. Twohig. Namazu Studios P.O. Box 34161 San Diego, CA 92163-4161 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: ~7 GB of ghost data???
oh man.. well, he's going to be wiping out 7GB of junk... :) When I went through process of deleting something like 400MB of junk.. it was not fun First I started off deleting by __key__ in batches of 500, then I had to limit down to 200.. then down to 100.. then down to 50.. then down to 10.. then it stopped responding for hours (I could not even fetch(1) from the Model). There must be a sanctioned way to remove 100,000s of entities based on how the datastore is structured. For example, does it make sense to do something like this. Use a cursor to: 1. Select __key__ from Model Order By __key__ 2. append every 10th (or 100th) result to a list.. and delete that list for every 100 or 200 or 500 entities added. 3. Once at end of cursor, start over at the beginning. That way, you wouldn't be deleting everything on the same table at the same time? The datastore completely died on me when I tried to straight delete by __key__ using GqlQuery in a loop.. Just kept getting slower and slower. (I think maybe directly deleting by key_name might work better but I never had to do a bulk delete again.. so have not tested that theory). On Mon, Mar 22, 2010 at 5:19 PM, Patrick Twohig patr...@namazustudios.comwrote: I'd use a cursor on the task queue. Do bulk deletes in blocks of 500 (I think that's the most keys you can pass to delete on a single call) and it shouldn't be that hard to wipe it out. Cheers! On Mon, Mar 22, 2010 at 1:45 PM, homunq jameson.qu...@gmail.com wrote: OK, after hashing it out on IRC, I see that I have to erase my data and start again. Since it took me 3 days of CPU quota to add the data, I want to know if I can erase it quickly. 1. Is the overhead for erasing data (and thus whittling down indexes) over half the overhead from adding it? Under 10%? Or what? (I don't need exact numbers, just approximates. 2. If it's more like half - is there some way to just nuke all my data and start over? Thanks, Jameson On 22 mar, 03:42, Nick Johnson (Google) nick.john...@google.com wrote: Hi, The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation? -Nick Johnson On Sun, Mar 21, 2010 at 3:39 AM, homunq jameson.qu...@gmail.com wrote: Something is wrong. My app is showing with 7.42GB of total stored data, but only 615 MB of datastore. There is only one version string uploaded, which is almost 150MB, and nothing in the blobstore. This discrepancy has been getting worse - several hours ago (longer than the period since datastore statistics were updated, if you're wondering), there were the same 615 MB in the datastore, and only 3.09GB of total stored data. (at that time, my theory was that it was old uploads of tweaks to the same version - but the numbers have gone far, far beyond that explanation now.) It's not some exploding index; the only non-default index I have is on an entity type with just 33 entities. Here's the line from my dashboard: Total Stored Data$0.005/GByte-day82% 7.42 of 9.00 GBytes $0.04 / $0.04 And here is the word from my datastore statistics: Last updatedTotal number of entitiesSize of all entities 1:32:13 ago 232,867 615 MBytes (metadata 11%, if that matters) Please, can someone help me figure out this issue? I'd be happy to share any info or code which would help track this down. My app id is vulahealth. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.comgoogle-appengine%2Bunsubscrib e...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Patrick H. Twohig. Namazu Studios P.O. Box 34161 San Diego, CA 92163-4161 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post
Re: [google-appengine] Re: ~7 GB of ghost data???
Hi, On Mon, Mar 22, 2010 at 8:45 PM, homunq jameson.qu...@gmail.com wrote: OK, after hashing it out on IRC, I see that I have to erase my data and start again. Why is that? Wouldn't updating the data be a better option? Since it took me 3 days of CPU quota to add the data, I want to know if I can erase it quickly. 1. Is the overhead for erasing data (and thus whittling down indexes) over half the overhead from adding it? Under 10%? Or what? (I don't need exact numbers, just approximates. It should be significantly lower - you can do a keys-only query, and delete the returned keys. -Nick Johnson 2. If it's more like half - is there some way to just nuke all my data and start over? Thanks, Jameson On 22 mar, 03:42, Nick Johnson (Google) nick.john...@google.com wrote: Hi, The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation? -Nick Johnson On Sun, Mar 21, 2010 at 3:39 AM, homunq jameson.qu...@gmail.com wrote: Something is wrong. My app is showing with 7.42GB of total stored data, but only 615 MB of datastore. There is only one version string uploaded, which is almost 150MB, and nothing in the blobstore. This discrepancy has been getting worse - several hours ago (longer than the period since datastore statistics were updated, if you're wondering), there were the same 615 MB in the datastore, and only 3.09GB of total stored data. (at that time, my theory was that it was old uploads of tweaks to the same version - but the numbers have gone far, far beyond that explanation now.) It's not some exploding index; the only non-default index I have is on an entity type with just 33 entities. Here's the line from my dashboard: Total Stored Data$0.005/GByte-day82% 7.42 of 9.00 GBytes $0.04 / $0.04 And here is the word from my datastore statistics: Last updatedTotal number of entitiesSize of all entities 1:32:13 ago 232,867 615 MBytes (metadata 11%, if that matters) Please, can someone help me figure out this issue? I'd be happy to share any info or code which would help track this down. My app id is vulahealth. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com . To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.comgoogle-appengine%2Bunsubscrib e...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.comgoogle-appengine%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.