Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
I find Google's posted solution quite suboptimal as it is too expensive (http://code.google.com/apis/maps/articles/geospatial.html) There is a simple trick to get rid of this problem. Instead of indexing all geocells in a StringListProperty you only index the most detailed cell: instead of [7, 7e, 7e3, 7e3a, 7e3a4] you only index 7e3a4 converted to a int64. To search you do range scans. Finding all items in the cell 7e3 is a range scan like geohash = 7e3 and geohash 7e4 I have a library in python that does all this. And some more performance tricks like merging 2 cells next to each other into a single range scan etc etc. I found that my solution performs a tiny bit better and is much cheaper cause I dont need StringListProperty in my index but just a simple IntegerProperty. Of course my solution has one major drawback: You can not do additional inequality searches, since my range scans already uses the inequality (but you can still do bucketing to solve this issue) and of corse you can do additional filters. If enough people are interested in my solution ill open source it. Cheers, -Andrin On Fri, Jan 6, 2012 at 4:28 AM, Vivek Puri v...@vivekpuri.com wrote: Even i have a table with 1.5TB of data. I need to truncate it but dont want to give thousands to delete data(i had paid thousands in old pricing model for another table. Not sure how much more it will cost now), while i pay hundreds for the data to be there. AppEngine team really needs to have a cheaper way to delete data. On Jan 5, 6:57 pm, Yohan yohan.lau...@gmail.com wrote: Hi, I feel your pain. it cost me a few thousand dollars to delete my millions enities from the datastore after a migration job (ikai never replied my post though...) and im still paying since the deletion is not completed yet (spending 100-300$ a day for the past 2 weeks now!!). Not doing much just running the delete all mapreduce job from the admin panel. There is totally somethig wrong with the way datastore writes are priced and google should seriously do something about it before they lose their big customers (i.e. the ones affected by this problem). It is simply too costly to go through your data to change an index or update stuff or delete your data. And in your case (like mine) even if you want to take your data out to externalize your custom search an storage it will cost you X000$+ to take it out and another XX,000$ to cleanup behind you (you seem to have a lot of indexed properties in your dataset). Please keep me posted on how things go with you as I'm still hoping i can get some credit/refund/assisance from google at this stage although i havent heard from them. On Jan 6, 7:24 am, Corey [Firespotter] co...@firespotter.com wrote: I work with Petey on this and can help clarify some of the details. The Entities; We have a lot of entities (~14mi) each of which have a StringListProperty called geoboxes. Like so: class Place(search.SearchableModel): name = db.StringProperty() ... # Location specific fields. coordinates = db.GeoPtProperty(default=None) geohash = db.StringProperty() geoboxes = db.StringListProperty() Background (details on geoboxing at bottom): We're running a mapreduce to change the geobox sizes/precision for a large number of entities. These entities currently have a 'geoboxes' StringListProperty with ~20 strings. For example: geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892| 37.339|-121.891', ...] We are changing those 20 strings to 20 new strings. Example: geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926', u'37.3411|-121.8929|37.3395|-121.8916', ...] The Cost: We did almost this same mapreduce when we first added the geoboxes back in July. In that case we were populating the list for the first time so we can assume half as many operations were required (no removing of old values). Total cost i July was ~$160 for the CPU time. When we ran the mapreduce again this week to change the box sizes the cost was $18 for Frontend Instance Hours, $15 for Datastore Reads (21mil) and $2,500 for Datastore Writes (2500mil). This was not a complete run of the mapreduce. We aborted it after 5.4mil (38%) of the entities were updated. Hence Petey's estimate that the full update would cost $6,500. The Operations: Each entity update is removing ~20 existing strings from the geoboxes StringList and adding 20 more. The geobox property is indexed (and has to be) and is involved in 3 composite indexes so as best I understand it this means each string change results in 10 writes (4 + 2 * 3). So on every entity we update the geoboxes we perform 401 write operations (1 + 10 * 40). This agrees pretty well with the charges (2,500,000,000 ops / 5,424,000 entities) = 460 ops per entity. That's a lot of
Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
On Fri, Jan 6, 2012 at 7:47 AM, Richard Watson richard.wat...@gmail.com wrote: What if you had the gps data as children of each entry and then used a keys-only query to match, and then fetch the parents. I forget the technique's name, maybe someone else remembers. The benefit is that when you need to edit gps coords you leave the parent alone. Data in the parent isn't duplicated and all changes only happen to the children. No parent data is re-indexed so you reduce datastore charges on updates. I'm not 100% sure it'd help but it might be worth testing. This shouldn't help. Re-puting an entity won't cause index updates if the indexed values don't change. The relation index entity pattern is only useful when you have very large #s of index items (many thousands). You wouldn't want to do it for 20 short strings. Jeff -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
Brian (apologies if that is not your name), How much of the costs are instance hours versus datastore writes? There's probably something going on here. The largest costs are to update indexes, not entities. Assuming $6500 is the cost of datastore writes alone, that breaks down to: ~$0.0004 a write Pricing is $0.10 per 100k operations, so that means using this equation: (6500.00 / 1400) / (0.10 / 10) You're doing about 464 write operations per put, which roughly translates to 6.5 billion writes. I'm trying to extrapolate what you are doing, and it sounds like you are doing full text indexing or something similar ... and having to update all the indexes. When you update a property, it takes a certain amount of writes. Assuming you are changing String properties, each property you update takes this many writes: - 2 indexes deleted (ascending and descending) - 2 indexes update (ascending and descending) So if you were only updating all the list properties, that means you are updating 100 list properties. Given that this is a regular thing you need to do, perhaps there is an engineering solution for what you are trying to do that will be more cost effective. Can you describe why you're running this job? What features does this support in your product? -- Ikai Lan Developer Programs Engineer, Google App Engine plus.ikailan.com | twitter.com/ikai On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote: In this one case we had to change all of the items in the listproperty. In our most common case we might have to add and delete a couple items to the list property every once in a while. That would still cost us well over $1,000 each time. Most of the reasons for this type of data in our product is to compensate for the fact that there isn't full text search yet. I know they are beta testing full text, but I'm still worried that that also might be too expensive per write. On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote: A couple thoughts. Maybe the GAE team should borrow the idea of spot prices from Amazon. That's a great way to have lower-priority jobs that can run when there are instances available. We set the price we're willing to pay, if the spot cost drops below that, we get the resources. It creates a market where more urgent jobs get done sooner and Google makes better use of quiet periods. On your issue: Do you need to update every entity when you do this? How many items on the listproperty need to be changed? Could you tell us a bit more of what the data looks like? I'm thinking that 14 million entities x 18 items each is the amount of entries you really have, each distributed across at least 3 servers and then indexed. That seems like a lot of writes if you're re-writing everything. It's likely a bad idea to rely on an infrastructure change to fix this (recurring) issue, but there is hopefully a way to reduce the amount of writes you have to do. Also, could you maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? Has anyone done anything like this? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
I think your problem is similar to the mine. http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74 Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list? For example Modeling (Objectify annotations) @Entity class RelationIndex () { @Parent KeyUser ownerKey; @Indexed ListKey receiverKeyList; } Define X = nº New items for add to the list. Y = nº Entities to update (same entity group), 1 list property indexed per entity Z = nº Items before updating list properties. Magic calculator Total write ops = Y * 2012/1/5 Ikai Lan (Google) ika...@google.com Brian (apologies if that is not your name), How much of the costs are instance hours versus datastore writes? There's probably something going on here. The largest costs are to update indexes, not entities. Assuming $6500 is the cost of datastore writes alone, that breaks down to: ~$0.0004 a write Pricing is $0.10 per 100k operations, so that means using this equation: (6500.00 / 1400) / (0.10 / 10) You're doing about 464 write operations per put, which roughly translates to 6.5 billion writes. I'm trying to extrapolate what you are doing, and it sounds like you are doing full text indexing or something similar ... and having to update all the indexes. When you update a property, it takes a certain amount of writes. Assuming you are changing String properties, each property you update takes this many writes: - 2 indexes deleted (ascending and descending) - 2 indexes update (ascending and descending) So if you were only updating all the list properties, that means you are updating 100 list properties. Given that this is a regular thing you need to do, perhaps there is an engineering solution for what you are trying to do that will be more cost effective. Can you describe why you're running this job? What features does this support in your product? -- Ikai Lan Developer Programs Engineer, Google App Engine plus.ikailan.com | twitter.com/ikai On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote: In this one case we had to change all of the items in the listproperty. In our most common case we might have to add and delete a couple items to the list property every once in a while. That would still cost us well over $1,000 each time. Most of the reasons for this type of data in our product is to compensate for the fact that there isn't full text search yet. I know they are beta testing full text, but I'm still worried that that also might be too expensive per write. On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote: A couple thoughts. Maybe the GAE team should borrow the idea of spot prices from Amazon. That's a great way to have lower-priority jobs that can run when there are instances available. We set the price we're willing to pay, if the spot cost drops below that, we get the resources. It creates a market where more urgent jobs get done sooner and Google makes better use of quiet periods. On your issue: Do you need to update every entity when you do this? How many items on the listproperty need to be changed? Could you tell us a bit more of what the data looks like? I'm thinking that 14 million entities x 18 items each is the amount of entries you really have, each distributed across at least 3 servers and then indexed. That seems like a lot of writes if you're re-writing everything. It's likely a bad idea to rely on an infrastructure change to fix this (recurring) issue, but there is hopefully a way to reduce the amount of writes you have to do. Also, could you maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? Has anyone done anything like this? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this
Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
Iván, 2012/1/6 Iván Rodríguez ivan.rd...@gmail.com I think your problem is similar to the mine. http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74 Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list? This page can help you work out the costs for your particular entities and indexes: http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost E.g., it details the costs for the different datastore operations given an entity's properties and indexes. For example Modeling (Objectify annotations) @Entity class RelationIndex () { @Parent KeyUser ownerKey; @Indexed ListKey receiverKeyList; } Define X = nº New items for add to the list. Y = nº Entities to update (same entity group), 1 list property indexed per entity Z = nº Items before updating list properties. Magic calculator Total write ops = Y * 2012/1/5 Ikai Lan (Google) ika...@google.com Brian (apologies if that is not your name), How much of the costs are instance hours versus datastore writes? There's probably something going on here. The largest costs are to update indexes, not entities. Assuming $6500 is the cost of datastore writes alone, that breaks down to: ~$0.0004 a write Pricing is $0.10 per 100k operations, so that means using this equation: (6500.00 / 1400) / (0.10 / 10) You're doing about 464 write operations per put, which roughly translates to 6.5 billion writes. I'm trying to extrapolate what you are doing, and it sounds like you are doing full text indexing or something similar ... and having to update all the indexes. When you update a property, it takes a certain amount of writes. Assuming you are changing String properties, each property you update takes this many writes: - 2 indexes deleted (ascending and descending) - 2 indexes update (ascending and descending) So if you were only updating all the list properties, that means you are updating 100 list properties. Given that this is a regular thing you need to do, perhaps there is an engineering solution for what you are trying to do that will be more cost effective. Can you describe why you're running this job? What features does this support in your product? -- Ikai Lan Developer Programs Engineer, Google App Engine plus.ikailan.com | twitter.com/ikai On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote: In this one case we had to change all of the items in the listproperty. In our most common case we might have to add and delete a couple items to the list property every once in a while. That would still cost us well over $1,000 each time. Most of the reasons for this type of data in our product is to compensate for the fact that there isn't full text search yet. I know they are beta testing full text, but I'm still worried that that also might be too expensive per write. On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote: A couple thoughts. Maybe the GAE team should borrow the idea of spot prices from Amazon. That's a great way to have lower-priority jobs that can run when there are instances available. We set the price we're willing to pay, if the spot cost drops below that, we get the resources. It creates a market where more urgent jobs get done sooner and Google makes better use of quiet periods. On your issue: Do you need to update every entity when you do this? How many items on the listproperty need to be changed? Could you tell us a bit more of what the data looks like? I'm thinking that 14 million entities x 18 items each is the amount of entries you really have, each distributed across at least 3 servers and then indexed. That seems like a lot of writes if you're re-writing everything. It's likely a bad idea to rely on an infrastructure change to fix this (recurring) issue, but there is hopefully a way to reduce the amount of writes you have to do. Also, could you maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? Has anyone done anything like this? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at
Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
On Fri, Jan 6, 2012 at 8:00 AM, Amy Unruh amyu+gro...@google.com wrote: Iván, 2012/1/6 Iván Rodríguez ivan.rd...@gmail.com I think your problem is similar to the mine. http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74 Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list? This page can help you work out the costs for your particular entities and indexes: http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost E.g., it details the costs for the different datastore operations given an entity's properties and indexes. See this as well: http://code.google.com/appengine/docs/python/datastore/entities.html#Understanding_Write_Costs, which discusses multi-value properties. These can lead to expensive indexes. -Amy For example Modeling (Objectify annotations) @Entity class RelationIndex () { @Parent KeyUser ownerKey; @Indexed ListKey receiverKeyList; } Define X = nº New items for add to the list. Y = nº Entities to update (same entity group), 1 list property indexed per entity Z = nº Items before updating list properties. Magic calculator Total write ops = Y * 2012/1/5 Ikai Lan (Google) ika...@google.com Brian (apologies if that is not your name), How much of the costs are instance hours versus datastore writes? There's probably something going on here. The largest costs are to update indexes, not entities. Assuming $6500 is the cost of datastore writes alone, that breaks down to: ~$0.0004 a write Pricing is $0.10 per 100k operations, so that means using this equation: (6500.00 / 1400) / (0.10 / 10) You're doing about 464 write operations per put, which roughly translates to 6.5 billion writes. I'm trying to extrapolate what you are doing, and it sounds like you are doing full text indexing or something similar ... and having to update all the indexes. When you update a property, it takes a certain amount of writes. Assuming you are changing String properties, each property you update takes this many writes: - 2 indexes deleted (ascending and descending) - 2 indexes update (ascending and descending) So if you were only updating all the list properties, that means you are updating 100 list properties. Given that this is a regular thing you need to do, perhaps there is an engineering solution for what you are trying to do that will be more cost effective. Can you describe why you're running this job? What features does this support in your product? -- Ikai Lan Developer Programs Engineer, Google App Engine plus.ikailan.com | twitter.com/ikai On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote: In this one case we had to change all of the items in the listproperty. In our most common case we might have to add and delete a couple items to the list property every once in a while. That would still cost us well over $1,000 each time. Most of the reasons for this type of data in our product is to compensate for the fact that there isn't full text search yet. I know they are beta testing full text, but I'm still worried that that also might be too expensive per write. On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote: A couple thoughts. Maybe the GAE team should borrow the idea of spot prices from Amazon. That's a great way to have lower-priority jobs that can run when there are instances available. We set the price we're willing to pay, if the spot cost drops below that, we get the resources. It creates a market where more urgent jobs get done sooner and Google makes better use of quiet periods. On your issue: Do you need to update every entity when you do this? How many items on the listproperty need to be changed? Could you tell us a bit more of what the data looks like? I'm thinking that 14 million entities x 18 items each is the amount of entries you really have, each distributed across at least 3 servers and then indexed. That seems like a lot of writes if you're re-writing everything. It's likely a bad idea to rely on an infrastructure change to fix this (recurring) issue, but there is hopefully a way to reduce the amount of writes you have to do. Also, could you maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? Has anyone done anything like this? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message