Christophe Combelles a écrit :
Hello,

What should I do to have a data structure which is memory scalable?

Consider the following large btree:

$ ./debugzope

    >>> from BTrees.OOBTree import OOBTree
    >>> root['btree']=OOBTree()
    >>> for i in xrange(700000):
    ...   root['btree'][i] = tuple(range(i,i+30))
    ...
    >>> import transaction
    >>> transaction.commit()

Quit and restart  ./debugzope

Now I just want to know if some value is in the btree:

    >>> 'value' in root['btree'].values()


Ok, the story could be called: "ZODB is great, but take care of what you do with persistency". There are 3 solutions to this problem. One ugly, one workaround, and the correct one. I found the ugly one; thanks to Dennis and Chris for pointing to the workaround and the correct one.

The whole btree is raised to the memory, even when I do a simple loop such as:

    >>> for i in root['btree']:
    ...     pass

(it's the same with items(), iteritems(), values(), itervalues().)


1) First the *ugly* one: I abort the transaction every N loops:

    >>> import transaction
    >>> a=0
    >>> for i in root['b']:
    ...     a+=1
    ...     if not a % 5000:
    ...         transaction.abort()
    ...

That works, but that's definitely not the right thing to do, I suspect that by aborting the transaction in the middle of the read, someone else might be able to modify the btree before I've finished my read. (zodb experts, please confirm)


2) Now a good *workaround* (that I will eventually use, because it's too late for me to change the data structure of my app, and it happens to be the fastest solution). It's almost the same, except that instead of aborting the transaction, we periodically minimize the cache of the connection to the ZODB:

    >>> a=0
    >>> for i in root['btree']:
    ...     a+=1
    ...     if not a % 5000:
    ...         root['btree']._p_jar.cacheMinimize()
    ...

This way, the maximum memory used corresponds to 5000 tuples.


3) the *correct* solution is to store real persistent objects in the btree.
(ie objects that derive from persistent.Persistent).
That works , and eats zero memory. But it's slower than tuples.
Non-persistent tuples are persisted because they are part of a persistent object, but they are considered an integral part of the btree, and not individual separate persistent objects.

That's my understanding, however that does not really explain why looping over non-persistent objects in a btree should absolutely raise everything in the memory.

And what about IIBTrees? (integers are not persistent by themselves)


Christophe



or compute the length

    >>> len(root['btree'])
(I'm already using some separate lazy bookkeeping for the length, but even if len() is time consuming for a btree, it should be possible from a memory point of view)

This loads the whole btree in memory (~500MB), and that memory never gets released! If the btree grows, how will I be able to use it? (>2GB)

I've tried to scan the btree by using slices, using root['btree'].itervalues(min,max), and by trying to do some transaction.abort()/commit()/savepoint()/anything() between the slices. But every slice I parse allocates yet another amount of memory, and when the whole btree has been scanned using slices, it's like the whole btree was in memory.

I've also tried with lists, the result is the same, except the memory gets eaten even quicker.

What I understand is that the ZODB wakes up everything, and the memory allocator of python (2.4) never release the memory. Is there a solution or something I missed in the API of the ZODB or BTrees or python itself?

thanks,
Christophe




_______________________________________________
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users



_______________________________________________
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users

Reply via email to