Donovan,

Your description of what you're doing and why isn't very complete.

Let me re-state your points to see if I understand you correctly:

1.  About 3,000 new documents are processed each day.

2.  When a new document comes in, the task queue code you posted runs
against that new document.  On average, a new document is associated with
3,500 index keys.

What does gen_keys() do exactly?  How does it generate its list of db.Keys?

How big is the average indexes entity?

More than likely, I'm guessing that pulling all the associated document keys
for the various index entities out of the datastore just to append a single
key to the end of each array.. is wasting some resources.

But, it's hard to tell without seeing more actual code.. and maybe getting a
clearer understanding of why you're doing what you're doing.

Thanks for additional info.

On Thu, Jan 6, 2011 at 3:22 PM, Donovan Hide <donovanh...@gmail.com> wrote:

> oops, task queue code should be:
>
> keys = gen_keys(document) // Builds a list of db.Key instances based
> on the document
> indexes=db.get(keys)
> upserts=[]
> for i,key in enumerate(indexes):
>   if indexes[i] is None:
>       upserts.append(I(key=keys[i],v=array('I',[document_id])))
>    elif document_id not in indexes[i].v:
>         indexes[i].v.append(document_id)
>        upserts.append(indexes[i])
> db.put(upserts)
>
> On 6 January 2011 19:36, Donovan <donovanh...@gmail.com> wrote:
> > Hi,
> >
> > I'm using a very simple model to store arrays of document ids for an
> > inverted index based on 3 million documents.
> >
> > class I(db.Model):
> >    v=ArrayProperty(typecode="I",required=True)
> >
> > which uses:
> >
> >
> http://appengine-cookbook.appspot.com/recipe/store-arrays-of-numeric-values-efficiently-in-the-datastore/
> >
> > I have a simple task queue that includes the following piece of logic
> > which loops 3,000 times a day, for new incoming documents which
> > generate on average 3,500 keys each, to update the index:
> >
> > keys = gen_keys(document) // Builds a list of db.Key instances based
> > on the document
> > indexes=db.get(keys)
> > upserts=[]
> > for i,key in enumerate(indexes):
> >    if indexes[i] is None:
> >        upserts.append(I(key=keys[i],v=array('I',[document_id])))
> >    elif news_article_id not in indexes[i].v:
> >         indexes[i].v.append(document_id)
> >         upserts.append(indexes[i])
> > db.put(upserts)
> >
> > This loop leads to datastore CPU usage of 48 hours per 1000 documents
> > which means a daily spend of $16.80 just for the datastore updates,
> > which seems quite expensive given how something like Kyoto Cabinet
> > running on conventional hosting could easily deal with this load. Does
> > anyone have any ideas for minimizing the datastore CPU usage? My hunch
> > is that the datastore CPU usage is a bit overpriced :(
> >
> > Cheers,
> > Donovan.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com<google-appengine%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to