On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <hlinnakan...@vmware.com>wrote:
> The BIRCH algorithm as described in the paper describes building a tree in > memory. If I understood correctly, you're suggesting to use a pre-built > GiST index instead. Interesting idea! > > There are a couple of signifcant differences between the CF tree described > in the paper and GiST: > > 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, > a leaf item represents a cluster, which consists of one or more tuples. So > the CF tree doesn't store an entry for every input tuple, which makes it > possible to keep it in memory. > > 2. In the CF tree, "all entries in a leaf node must satisfy a threshold > requirement, with respect to a threshold value T: the diameter (or radius) > has to be less than T". GiST imposes no such restrictions. An item can > legally be placed anywhere in the tree; placing it badly will just lead to > degraded search performance, but it's still a legal GiST tree. > > 3. A GiST index, like any other index in PostgreSQL, holds entries also > for deleted tuples, until the index is vacuumed. So you cannot just use > information from a non-leaf node and use it in the result, as the > information summarized at a non-leaf level includes noise from the dead > tuples. > > Can you elaborate how you are planning to use a GiST index to implement > BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more > strict in where in the tree an item can be stored, and lets the operator > class to specify exactly when a node is split etc. > Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. ------ With best regards, Alexander Korotkov.