Hi All, we had a private conversation with Emmanuel on alternatives to implement the a transactional system and here is a summary of it. Feel free to comment and let us what you think:
****** We basically discussed two ways to implement transactions with MVCC consistency guarantees for the readers (I guess we all agree on MVCC): *Keep previous version of entries and index entries in partition. This would require us to make each partition version aware. Note that the versioning is at logical level. Lets say we have two versions of an entry where entry1 satisfies reads between version 5-8 and entry2 satisfies versions 8-infinity. And we have a reader at version 7 and we do a master table lookup for this entry. Then we either need to fetch all entries with a single lookup and find the entry that matches our version or follow a backward chain from the most recent entry to find the version we want. The second option is what postgres does. Also note that since the versioning is on a logical basis cleaning data by reusing entries or index values is not so easy. So we would probably need a garbage collector. Getting the performance of garbage collector right could be a problem In order to avoid garbage collector and high cost of finding previous versions of entries and index values, we could keep the previous versions in memory as long as a reader needs them. My take on this is that, doing versioning at a phyisal page level with hierarchical data structures like Btree works well because as long as you keep the pointer to the old version of the root, you can find the previous versions of the pages easily, and you can reuse the pages when they are not needed(or compact the file after a while). But with versioning at logical level, you probably need garbage collector and finding the version needed is costly if you do not keep previous versions in memory. *Using write ahead log for mvcc. This has the overhead of managing the write ahead log. However, mvcc consistency and transaction recovery is not the only thing a write ahead log could be needed for. If we do not have a write ahead log, then a separate logging solution for other problems is needed. For example, at ApahceDS we have: -change log -journal -consumer log for replication If we have a write ahead log that is aware of transactions across partitions, then it can be useful for the above problems and you can also: - make replication transaction aware. Especially useful if transactions are not just single ldap modifcation requests but stored procedures or triggers. - rollback all your server to a consistent point in time using the log. **** We also talked about having an mvcc backend. If this way is chosen, then we will probably have txns per partition. This is what ldap servers that I am aware of do. However, with this, it would be difficult to have cross partition modifications and we would need to do additional work for the existing jdbm partition and upcoming HBASE partition. Also we would still need logical logging to implement some of the things mentioned above. please feel free to comment, regards, Selcuk