Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-06 Thread Andrin von Rechenberg
I find Google's posted solution quite suboptimal as it is too expensive
(http://code.google.com/apis/maps/articles/geospatial.html)

There is a simple trick to get rid of this problem. Instead of indexing
all geocells in a StringListProperty you only index the most detailed
cell: instead of [7, 7e, 7e3, 7e3a, 7e3a4] you only index 7e3a4
converted to a int64. To search you do range scans. Finding all
items in the cell 7e3 is a range scan like
geohash = 7e3 and geohash  7e4

I have a library in python that does all this. And some more performance
tricks like merging 2 cells next to each other into a single range scan
etc etc. I found that my solution performs a tiny bit better and is
much cheaper cause I dont need StringListProperty in my index but
just a simple IntegerProperty. Of course my solution has one major
drawback: You can not do additional inequality searches, since my
range scans already uses the inequality (but you can still do bucketing to
solve this issue) and of corse you can do additional filters.
If enough people are interested in my solution ill open source it.

Cheers,
-Andrin

On Fri, Jan 6, 2012 at 4:28 AM, Vivek Puri v...@vivekpuri.com wrote:

 Even i have a table with 1.5TB of data. I need to truncate it but dont
 want to give thousands to delete data(i had paid thousands in old
 pricing model for another table. Not sure how much more it will cost
 now), while i pay hundreds for the data to be there. AppEngine team
 really needs to have a cheaper way to delete data.


 On Jan 5, 6:57 pm, Yohan yohan.lau...@gmail.com wrote:
  Hi,
 
  I feel your pain. it cost me a few thousand dollars to delete my
  millions enities from the datastore after a migration job (ikai never
  replied my post though...) and im still paying since the deletion is
  not completed yet (spending 100-300$ a day for the past 2 weeks
  now!!). Not doing much just running the delete all mapreduce job
  from the admin panel.
 
  There is totally somethig wrong with the way datastore writes are
  priced and google should seriously do something about it before they
  lose their big customers (i.e. the ones affected by this problem).
 
  It is simply too costly to go through your data to change an index or
  update stuff or delete your data. And in your case (like mine) even if
  you want to take your data out to externalize
  your custom search an storage it will cost you X000$+ to take it out
  and another XX,000$ to cleanup behind you (you seem to have a lot of
  indexed properties in your dataset).
 
  Please keep me posted on how things go with you as I'm still hoping i
  can get some credit/refund/assisance from google at this stage
  although i havent heard from them.
 
  On Jan 6, 7:24 am, Corey [Firespotter] co...@firespotter.com
  wrote:
 
 
 
 
 
 
 
   I work with Petey on this and can help clarify some of the details.
 
   The Entities;
   We have a lot of entities (~14mi) each of which have a
   StringListProperty called geoboxes.  Like so:
   class Place(search.SearchableModel):
 name = db.StringProperty()
 ...
 # Location specific fields.
 coordinates = db.GeoPtProperty(default=None)
 geohash = db.StringProperty()
 geoboxes = db.StringListProperty()
 
   Background (details on geoboxing at bottom):
   We're running a mapreduce to change the geobox sizes/precision for a
   large number of entities.  These entities currently have a 'geoboxes'
   StringListProperty with ~20 strings.  For example:
   geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892|
   37.339|-121.891', ...]
   We are changing those 20 strings to 20 new strings.  Example:
   geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926',
   u'37.3411|-121.8929|37.3395|-121.8916', ...]
 
   The Cost:
   We did almost this same mapreduce when we first added the geoboxes
   back in July.  In that case we were populating the list for the first
   time so we can assume half as many operations were required (no
   removing of old values).  Total cost i July was ~$160 for the CPU
   time.
 
   When we ran the mapreduce again this week to change the box sizes the
   cost was $18 for Frontend Instance Hours, $15 for Datastore Reads
   (21mil) and $2,500 for Datastore Writes (2500mil).  This was not a
   complete run of the mapreduce.  We aborted it after 5.4mil (38%) of
   the entities were updated.  Hence Petey's estimate that the full
   update would cost $6,500.
 
   The Operations:
   Each entity update is removing ~20 existing strings from the geoboxes
   StringList and adding 20 more.  The geobox property is indexed (and
   has to be) and is involved in 3 composite indexes so as best I
   understand it this means each string change results in 10 writes (4 +
   2 * 3).  So on every entity we update the geoboxes we perform 401
   write operations (1 + 10 * 40).
 
   This agrees pretty well with the charges (2,500,000,000 ops /
   5,424,000 entities) = 460 ops per entity.
 
   That's a lot of 

Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-06 Thread Jeff Schnitzer
On Fri, Jan 6, 2012 at 7:47 AM, Richard Watson richard.wat...@gmail.com wrote:
 What if you had the gps data as children of each entry and then used a
 keys-only query to match, and then fetch the parents. I forget the
 technique's name, maybe someone else remembers.  The benefit is that when
 you need to edit gps coords you leave the parent alone. Data in the parent
 isn't duplicated and all changes only happen to the children. No parent data
 is re-indexed so you reduce datastore charges on updates. I'm not 100% sure
 it'd help but it might be worth testing.

This shouldn't help.  Re-puting an entity won't cause index updates if
the indexed values don't change.  The relation index entity pattern
is only useful when you have very large #s of index items (many
thousands).  You wouldn't want to do it for 20 short strings.

Jeff

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-05 Thread Ikai Lan (Google)
Brian (apologies if that is not your name),

How much of the costs are instance hours versus datastore writes? There's
probably something going on here. The largest costs are to update indexes,
not entities. Assuming $6500 is the cost of datastore writes alone, that
breaks down to:

~$0.0004 a write

Pricing is $0.10 per 100k operations, so that means using this equation:

(6500.00 / 1400) / (0.10 / 10)

You're doing about 464 write operations per put, which roughly translates
to 6.5 billion writes.

I'm trying to extrapolate what you are doing, and it sounds like you are
doing full text indexing or something similar ... and having to update all
the indexes. When you update a property, it takes a certain amount of
writes. Assuming you are changing String properties, each property you
update takes this many writes:

- 2 indexes deleted (ascending and descending)
- 2 indexes update (ascending and descending)

So if you were only updating all the list properties, that means you are
updating 100 list properties.

Given that this is a regular thing you need to do, perhaps there is an
engineering solution for what you are trying to do that will be more cost
effective. Can you describe why you're running this job? What features does
this support in your product?

--
Ikai Lan
Developer Programs Engineer, Google App Engine
plus.ikailan.com | twitter.com/ikai



On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote:

 In this one case we had to change all of the items in the
 listproperty. In our most common case we might have to add and delete
 a couple items to the list property every once in a while. That would
 still cost us well over $1,000 each time.

 Most of the reasons for this type of data in our product is to
 compensate for the fact that there isn't full text search yet. I know
 they are beta testing full text, but I'm still worried that that also
 might be too expensive per write.

 On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote:
  A couple thoughts.
 
  Maybe the GAE team should borrow the idea of spot prices from Amazon.
  That's a great way to have lower-priority jobs that can run when there
 are
  instances available. We set the price we're willing to pay, if the spot
  cost drops below that, we get the resources. It creates a market where
 more
  urgent jobs get done sooner and Google makes better use of quiet periods.
 
  On your issue:
  Do you need to update every entity when you do this? How many items on
 the
  listproperty need to be changed? Could you tell us a bit more of what the
  data looks like?
 
  I'm thinking that 14 million entities x 18 items each is the amount of
  entries you really have, each distributed across at least 3 servers and
  then indexed. That seems like a lot of writes if you're re-writing
  everything.  It's likely a bad idea to rely on an infrastructure change
 to
  fix this (recurring) issue, but there is hopefully a way to reduce the
  amount of writes you have to do.
 
  Also, could you maybe run your mapreduce on smaller sets of the data to
  spread it out over multiple days and avoid adding too many instances? Has
  anyone done anything like this?

 --
 You received this message because you are subscribed to the Google Groups
 Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/google-appengine?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-05 Thread Iván Rodríguez
I think your problem is similar to the mine.

http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

Ikai, please, can explain us how many cost in terms of write ops, should us
expect for updating indexed list property adding X items to the list?

For example

Modeling (Objectify annotations)

@Entity
class RelationIndex () {
@Parent
KeyUser ownerKey;
@Indexed
ListKey receiverKeyList;
}

Define

X = nº New items for add to the list.
Y = nº Entities to update (same entity group), 1 list property indexed per
entity
Z = nº Items before updating list properties.


Magic calculator

Total write ops = Y * 




2012/1/5 Ikai Lan (Google) ika...@google.com

 Brian (apologies if that is not your name),

 How much of the costs are instance hours versus datastore writes? There's
 probably something going on here. The largest costs are to update indexes,
 not entities. Assuming $6500 is the cost of datastore writes alone, that
 breaks down to:

 ~$0.0004 a write

 Pricing is $0.10 per 100k operations, so that means using this equation:

 (6500.00 / 1400) / (0.10 / 10)

 You're doing about 464 write operations per put, which roughly translates
 to 6.5 billion writes.

 I'm trying to extrapolate what you are doing, and it sounds like you are
 doing full text indexing or something similar ... and having to update all
 the indexes. When you update a property, it takes a certain amount of
 writes. Assuming you are changing String properties, each property you
 update takes this many writes:

 - 2 indexes deleted (ascending and descending)
 - 2 indexes update (ascending and descending)

 So if you were only updating all the list properties, that means you are
 updating 100 list properties.

 Given that this is a regular thing you need to do, perhaps there is an
 engineering solution for what you are trying to do that will be more cost
 effective. Can you describe why you're running this job? What features does
 this support in your product?

 --
 Ikai Lan
 Developer Programs Engineer, Google App Engine
 plus.ikailan.com | twitter.com/ikai



 On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote:

 In this one case we had to change all of the items in the
 listproperty. In our most common case we might have to add and delete
 a couple items to the list property every once in a while. That would
 still cost us well over $1,000 each time.

 Most of the reasons for this type of data in our product is to
 compensate for the fact that there isn't full text search yet. I know
 they are beta testing full text, but I'm still worried that that also
 might be too expensive per write.

 On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote:
  A couple thoughts.
 
  Maybe the GAE team should borrow the idea of spot prices from Amazon.
  That's a great way to have lower-priority jobs that can run when there
 are
  instances available. We set the price we're willing to pay, if the spot
  cost drops below that, we get the resources. It creates a market where
 more
  urgent jobs get done sooner and Google makes better use of quiet
 periods.
 
  On your issue:
  Do you need to update every entity when you do this? How many items on
 the
  listproperty need to be changed? Could you tell us a bit more of what
 the
  data looks like?
 
  I'm thinking that 14 million entities x 18 items each is the amount of
  entries you really have, each distributed across at least 3 servers and
  then indexed. That seems like a lot of writes if you're re-writing
  everything.  It's likely a bad idea to rely on an infrastructure change
 to
  fix this (recurring) issue, but there is hopefully a way to reduce the
  amount of writes you have to do.
 
  Also, could you maybe run your mapreduce on smaller sets of the data to
  spread it out over multiple days and avoid adding too many instances?
 Has
  anyone done anything like this?

 --
 You received this message because you are subscribed to the Google Groups
 Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/google-appengine?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/google-appengine?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this 

Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-05 Thread Amy Unruh
Iván,

2012/1/6 Iván Rodríguez ivan.rd...@gmail.com

 I think your problem is similar to the mine.


 http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

 Ikai, please, can explain us how many cost in terms of write ops, should
 us expect for updating indexed list property adding X items to the list?


This page can help you work out the costs for your particular entities and
indexes:

http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost
E.g., it details the costs for the different datastore operations given an
entity's properties and indexes.





 For example

 Modeling (Objectify annotations)

 @Entity
 class RelationIndex () {
 @Parent
 KeyUser ownerKey;
 @Indexed
 ListKey receiverKeyList;
 }

 Define

 X = nº New items for add to the list.
 Y = nº Entities to update (same entity group), 1 list property indexed per
 entity
 Z = nº Items before updating list properties.


 Magic calculator

 Total write ops = Y * 




 2012/1/5 Ikai Lan (Google) ika...@google.com

 Brian (apologies if that is not your name),

 How much of the costs are instance hours versus datastore writes? There's
 probably something going on here. The largest costs are to update indexes,
 not entities. Assuming $6500 is the cost of datastore writes alone, that
 breaks down to:

 ~$0.0004 a write

 Pricing is $0.10 per 100k operations, so that means using this equation:

 (6500.00 / 1400) / (0.10 / 10)

 You're doing about 464 write operations per put, which roughly translates
 to 6.5 billion writes.

 I'm trying to extrapolate what you are doing, and it sounds like you are
 doing full text indexing or something similar ... and having to update all
 the indexes. When you update a property, it takes a certain amount of
 writes. Assuming you are changing String properties, each property you
 update takes this many writes:

 - 2 indexes deleted (ascending and descending)
 - 2 indexes update (ascending and descending)

 So if you were only updating all the list properties, that means you are
 updating 100 list properties.

 Given that this is a regular thing you need to do, perhaps there is an
 engineering solution for what you are trying to do that will be more cost
 effective. Can you describe why you're running this job? What features does
 this support in your product?

 --
 Ikai Lan
 Developer Programs Engineer, Google App Engine
 plus.ikailan.com | twitter.com/ikai



 On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote:

 In this one case we had to change all of the items in the
 listproperty. In our most common case we might have to add and delete
 a couple items to the list property every once in a while. That would
 still cost us well over $1,000 each time.

 Most of the reasons for this type of data in our product is to
 compensate for the fact that there isn't full text search yet. I know
 they are beta testing full text, but I'm still worried that that also
 might be too expensive per write.

 On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote:
  A couple thoughts.
 
  Maybe the GAE team should borrow the idea of spot prices from Amazon.
  That's a great way to have lower-priority jobs that can run when there
 are
  instances available. We set the price we're willing to pay, if the spot
  cost drops below that, we get the resources. It creates a market where
 more
  urgent jobs get done sooner and Google makes better use of quiet
 periods.
 
  On your issue:
  Do you need to update every entity when you do this? How many items on
 the
  listproperty need to be changed? Could you tell us a bit more of what
 the
  data looks like?
 
  I'm thinking that 14 million entities x 18 items each is the amount of
  entries you really have, each distributed across at least 3 servers and
  then indexed. That seems like a lot of writes if you're re-writing
  everything.  It's likely a bad idea to rely on an infrastructure
 change to
  fix this (recurring) issue, but there is hopefully a way to reduce the
  amount of writes you have to do.
 
  Also, could you maybe run your mapreduce on smaller sets of the data to
  spread it out over multiple days and avoid adding too many instances?
 Has
  anyone done anything like this?

 --
 You received this message because you are subscribed to the Google
 Groups Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/google-appengine?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 

Re: [google-appengine] Re: Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

2012-01-05 Thread Amy Unruh
On Fri, Jan 6, 2012 at 8:00 AM, Amy Unruh amyu+gro...@google.com wrote:

 Iván,

 2012/1/6 Iván Rodríguez ivan.rd...@gmail.com

 I think your problem is similar to the mine.


 http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

 Ikai, please, can explain us how many cost in terms of write ops, should
 us expect for updating indexed list property adding X items to the list?


 This page can help you work out the costs for your particular entities and
 indexes:

 http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost
 E.g., it details the costs for the different datastore operations given an
 entity's properties and indexes.


See this as well:
http://code.google.com/appengine/docs/python/datastore/entities.html#Understanding_Write_Costs,
which discusses multi-value properties.  These can lead to expensive
indexes.

 -Amy







 For example

 Modeling (Objectify annotations)

 @Entity
 class RelationIndex () {
 @Parent
 KeyUser ownerKey;
 @Indexed
 ListKey receiverKeyList;
 }

 Define

 X = nº New items for add to the list.
 Y = nº Entities to update (same entity group), 1 list property indexed
 per entity
 Z = nº Items before updating list properties.


 Magic calculator

 Total write ops = Y * 




 2012/1/5 Ikai Lan (Google) ika...@google.com

 Brian (apologies if that is not your name),

 How much of the costs are instance hours versus datastore writes?
 There's probably something going on here. The largest costs are to update
 indexes, not entities. Assuming $6500 is the cost of datastore writes
 alone, that breaks down to:

 ~$0.0004 a write

 Pricing is $0.10 per 100k operations, so that means using this equation:

 (6500.00 / 1400) / (0.10 / 10)

 You're doing about 464 write operations per put, which roughly
 translates to 6.5 billion writes.

 I'm trying to extrapolate what you are doing, and it sounds like you are
 doing full text indexing or something similar ... and having to update all
 the indexes. When you update a property, it takes a certain amount of
 writes. Assuming you are changing String properties, each property you
 update takes this many writes:

 - 2 indexes deleted (ascending and descending)
 - 2 indexes update (ascending and descending)

 So if you were only updating all the list properties, that means you are
 updating 100 list properties.

 Given that this is a regular thing you need to do, perhaps there is an
 engineering solution for what you are trying to do that will be more cost
 effective. Can you describe why you're running this job? What features does
 this support in your product?

 --
 Ikai Lan
 Developer Programs Engineer, Google App Engine
 plus.ikailan.com | twitter.com/ikai



 On Thu, Jan 5, 2012 at 10:08 AM, Petey brianpeter...@gmail.com wrote:

 In this one case we had to change all of the items in the
 listproperty. In our most common case we might have to add and delete
 a couple items to the list property every once in a while. That would
 still cost us well over $1,000 each time.

 Most of the reasons for this type of data in our product is to
 compensate for the fact that there isn't full text search yet. I know
 they are beta testing full text, but I'm still worried that that also
 might be too expensive per write.

 On Jan 5, 6:54 am, Richard Watson richard.wat...@gmail.com wrote:
  A couple thoughts.
 
  Maybe the GAE team should borrow the idea of spot prices from Amazon.
  That's a great way to have lower-priority jobs that can run when
 there are
  instances available. We set the price we're willing to pay, if the
 spot
  cost drops below that, we get the resources. It creates a market
 where more
  urgent jobs get done sooner and Google makes better use of quiet
 periods.
 
  On your issue:
  Do you need to update every entity when you do this? How many items
 on the
  listproperty need to be changed? Could you tell us a bit more of what
 the
  data looks like?
 
  I'm thinking that 14 million entities x 18 items each is the amount of
  entries you really have, each distributed across at least 3 servers
 and
  then indexed. That seems like a lot of writes if you're re-writing
  everything.  It's likely a bad idea to rely on an infrastructure
 change to
  fix this (recurring) issue, but there is hopefully a way to reduce the
  amount of writes you have to do.
 
  Also, could you maybe run your mapreduce on smaller sets of the data
 to
  spread it out over multiple days and avoid adding too many instances?
 Has
  anyone done anything like this?

 --
 You received this message because you are subscribed to the Google
 Groups Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/google-appengine?hl=en.


  --
 You received this message