I find Google's posted solution quite suboptimal as it is too expensive
(http://code.google.com/apis/maps/articles/geospatial.html)

There is a simple trick to get rid of this problem. Instead of indexing
all geocells in a StringListProperty you only index the most detailed
cell: instead of [7, 7e, 7e3, 7e3a, 7e3a4] you only index "7e3a4"
converted to a int64. To search you do range scans. Finding all
items in the cell 7e3 is a range scan like
"geohash >= 7e3 and geohash < 7e4"

I have a library in python that does all this. And some more performance
tricks like merging 2 cells next to each other into a single range scan
etc etc. I found that my solution performs a tiny bit better and is
much cheaper cause I dont need StringListProperty in my index but
just a simple IntegerProperty. Of course my solution has one major
drawback: You can not do additional inequality searches, since my
range scans already uses the inequality (but you can still do bucketing to
solve this issue) and of corse you can do additional filters.
If enough people are interested in my solution ill open source it.

Cheers,
-Andrin

On Fri, Jan 6, 2012 at 4:28 AM, Vivek Puri <v...@vivekpuri.com> wrote:

> Even i have a table with 1.5TB of data. I need to truncate it but dont
> want to give thousands to delete data(i had paid thousands in old
> pricing model for another table. Not sure how much more it will cost
> now), while i pay hundreds for the data to be there. AppEngine team
> really needs to have a cheaper way to delete data.
>
>
> On Jan 5, 6:57 pm, Yohan <yohan.lau...@gmail.com> wrote:
> > Hi,
> >
> > I feel your pain. it cost me a few thousand dollars to delete my
> > millions enities from the datastore after a migration job (ikai never
> > replied my post though...) and im still paying since the deletion is
> > not completed yet (spending 100-300$ a day for the past 2 weeks
> > now!!). Not doing much just running the "delete all" mapreduce job
> > from the admin panel.
> >
> > There is totally somethig wrong with the way datastore writes are
> > priced and google should seriously do something about it before they
> > lose their big customers (i.e. the ones affected by this problem).
> >
> > It is simply too costly to go through your data to change an index or
> > update stuff or delete your data. And in your case (like mine) even if
> > you want to take your data out to externalize
> > your custom search an storage it will cost you X000$+ to take it out
> > and another XX,000$ to cleanup behind you (you seem to have a lot of
> > indexed properties in your dataset).
> >
> > Please keep me posted on how things go with you as I'm still hoping i
> > can get some credit/refund/assisance from google at this stage
> > although i havent heard from them.
> >
> > On Jan 6, 7:24 am, "Corey [Firespotter]" <co...@firespotter.com>
> > wrote:
> >
> >
> >
> >
> >
> >
> >
> > > I work with Petey on this and can help clarify some of the details.
> >
> > > The Entities;
> > > We have a lot of entities (~14mi) each of which have a
> > > StringListProperty called "geoboxes".  Like so:
> > >     class Place(search.SearchableModel):
> > >       name = db.StringProperty()
> > >       ...
> > >       # Location specific fields.
> > >       coordinates = db.GeoPtProperty(default=None)
> > >       geohash = db.StringProperty()
> > >       geoboxes = db.StringListProperty()
> >
> > > Background (details on geoboxing at bottom):
> > > We're running a mapreduce to change the geobox sizes/precision for a
> > > large number of entities.  These entities currently have a 'geoboxes'
> > > StringListProperty with ~20 strings.  For example:
> > > geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892|
> > > 37.339|-121.891', ...]
> > > We are changing those 20 strings to 20 new strings.  Example:
> > > geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926',
> > > u'37.3411|-121.8929|37.3395|-121.8916', ...]
> >
> > > The Cost:
> > > We did almost this same mapreduce when we first added the geoboxes
> > > back in July.  In that case we were populating the list for the first
> > > time so we can assume half as many operations were required (no
> > > removing of old values).  Total cost i July was ~$160 for the CPU
> > > time.
> >
> > > When we ran the mapreduce again this week to change the box sizes the
> > > cost was $18 for Frontend Instance Hours, $15 for Datastore Reads
> > > (21mil) and $2,500 for Datastore Writes (2500mil).  This was not a
> > > complete run of the mapreduce.  We aborted it after 5.4mil (38%) of
> > > the entities were updated.  Hence Petey's estimate that the full
> > > update would cost $6,500.
> >
> > > The Operations:
> > > Each entity update is removing ~20 existing strings from the geoboxes
> > > StringList and adding 20 more.  The geobox property is indexed (and
> > > has to be) and is involved in 3 composite indexes so as best I
> > > understand it this means each string change results in 10 writes (4 +
> > > 2 * 3).  So on every entity we update the geoboxes we perform 401
> > > write operations (1 + 10 * 40).
> >
> > > This agrees pretty well with the charges (2,500,000,000 ops /
> > > 5,424,000 entities) = 460 ops per entity.
> >
> > > That's a lot of writes and likely the core of the surprising cost.
> > > However, I'm not sure how we could avoid that with App Engine (open to
> > > ideas!), and since we could pay for dedicated servers for that amount,
> > > I think the pricing is probably off as well.
> >
> > > Even if we treat the geobox update as a one-time cost, we have other
> > > properties like scores, labels, etc that require occasional tweaking.
> > > Updating even a single indexed property across all these entities
> > > costs us $60-$100 and typically many times that in practice because
> > > these interesting fields tend to be used in composite indexes.
> >
> > > -Corey
> >
> > > Geoboxing Details
> > > Geoboxing is a technique used to search for entities near a point on
> > > the earth in a database that can only perform equality queries (like
> > > App Engine).  In short, you break up the world into boxes and record
> > > which box each entity belongs to as well as any nearby boxes.  Then
> > > you break up the world into larger boxes and repeat until you have a
> > > good range of sizes covered.
> > > There's a good article on the logic of algorithm here:
> http://code.google.com/appengine/articles/geosearch.html
> >
> > > On Jan 5, 11:58 am, "Ikai Lan (Google)" <ika...@google.com> wrote:
> >
> > > > Brian (apologies if that is not your name),
> >
> > > > How much of the costs are instance hours versus datastore writes?
> There's
> > > > probably something going on here. The largest costs are to update
> indexes,
> > > > not entities. Assuming $6500 is the cost of datastore writes alone,
> that
> > > > breaks down to:
> >
> > > > ~$0.0004 a write
> >
> > > > Pricing is $0.10 per 100k operations, so that means using this
> equation:
> >
> > > > (6500.00 / 14000000) / (0.10 / 100000)
> >
> > > > You're doing about 464 write operations per put, which roughly
> translates
> > > > to 6.5 billion writes.
> >
> > > > I'm trying to extrapolate what you are doing, and it sounds like you
> are
> > > > doing full text indexing or something similar ... and having to
> update all
> > > > the indexes. When you update a property, it takes a certain amount of
> > > > writes. Assuming you are changing String properties, each property
> you
> > > > update takes this many writes:
> >
> > > > - 2 indexes deleted (ascending and descending)
> > > > - 2 indexes update (ascending and descending)
> >
> > > > So if you were only updating all the list properties, that means you
> are
> > > > updating 100 list properties.
> >
> > > > Given that this is a regular thing you need to do, perhaps there is
> an
> > > > engineering solution for what you are trying to do that will be more
> cost
> > > > effective. Can you describe why you're running this job? What
> features does
> > > > this support in your product?
> >
> > > > --
> > > > Ikai Lan
> > > > Developer Programs Engineer, Google App Engine
> > > > plus.ikailan.com | twitter.com/ikai
> >
> > > > On Thu, Jan 5, 2012 at 10:08 AM, Petey <brianpeter...@gmail.com>
> wrote:
> > > > > In this one case we had to change all of the items in the
> > > > > listproperty. In our most common case we might have to add and
> delete
> > > > > a couple items to the list property every once in a while. That
> would
> > > > > still cost us well over $1,000 each time.
> >
> > > > > Most of the reasons for this type of data in our product is to
> > > > > compensate for the fact that there isn't full text search yet. I
> know
> > > > > they are beta testing full text, but I'm still worried that that
> also
> > > > > might be too expensive per write.
> >
> > > > > On Jan 5, 6:54 am, Richard Watson <richard.wat...@gmail.com>
> wrote:
> > > > > > A couple thoughts.
> >
> > > > > > Maybe the GAE team should borrow the idea of spot prices from
> Amazon.
> > > > > > That's a great way to have lower-priority jobs that can run when
> there
> > > > > are
> > > > > > instances available. We set the price we're willing to pay, if
> the spot
> > > > > > cost drops below that, we get the resources. It creates a market
> where
> > > > > more
> > > > > > urgent jobs get done sooner and Google makes better use of quiet
> periods.
> >
> > > > > > On your issue:
> > > > > > Do you need to update every entity when you do this? How many
> items on
> > > > > the
> > > > > > listproperty need to be changed? Could you tell us a bit more of
> what the
> > > > > > data looks like?
> >
> > > > > > I'm thinking that 14 million entities x 18 items each is the
> amount of
> > > > > > entries you really have, each distributed across at least 3
> servers and
> > > > > > then indexed. That seems like a lot of writes if you're
> re-writing
> > > > > > everything.  It's likely a bad idea to rely on an infrastructure
> change
> > > > > to
> > > > > > fix this (recurring) issue, but there is hopefully a way to
> reduce the
> > > > > > amount of writes you have to do.
> >
> > > > > > Also, could you maybe run your mapreduce on smaller sets of the
> data to
> > > > > > spread it out over multiple days and avoid adding too many
> instances? Has
> > > > > > anyone done anything like this?
> >
> > > > > --
> > > > > You received this message because you are subscribed to the Google
> Groups
> > > > > "Google App Engine" group.
> > > > > To post to this group, send email to
> google-appengine@googlegroups.com.
> > > > > To unsubscribe from this group, send email to
> > > > > google-appengine+unsubscr...@googlegroups.com.
> > > > > For more options, visit this group at
> > > > >http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to