I've followed this thread with interest since I have a Zope site with
tens of millions of entries in BTrees. It scales well, but it requires
many tricks to make it work.

Roche Compaan wrote these great pieces on ZODB, Data.fs size and
scalability at 
http://www.upfrontsystems.co.za/Members/roche/where-im-calling-from/catalog-indexes
and 
http://www.upfrontsystems.co.za/Members/roche/where-im-calling-from/fat-doesnt-matter
.

My own in-house product is similar to GoogleAnalytics. I have to use a
cascading BTree structure (a btree of btrees of btrees) to handle the
volume. This is because BTrees do slow down the more items they
contain. This is not a ZODB limitation or flaw - it is just how they
work.

My structure allows for fast inserts, but they also allow aggregation
of data. So if my lowest level of BTrees store hits for a particular
hour in time then the containing BTree always knows exactly how many
hits were made in a day. I update all parent BTrees as soon as an item
is inserted. The cost of this operation is O(1) for every parent.
These are all details but every single one influenced my design.

What is important is that you cannot just use the ZCatalog to index
tens of millions of items since every index is a single BTree and will
thus suffer the larger it gets. So you must roll your own to fit your
problem domain.

Data warehousing is probably a good idea as well.

My problem domain allows me to defer inserts, so I have a queuerunner
that commits larger transactions in batches. This is better than lots
of small writes. This may of course not fit your model.

Familiarize yourself with TreeSets and set operations in Python (union
etc.) since those tools form the backbone of catalogueing.

Hedley
_______________________________________________
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Reply via email to