Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

Craig A. James Thu, 26 Jan 2006 18:34:45 -0800

Ron <[EMAIL PROTECTED]> writes:

We have two problems here.
The first is that the page splitting code for these indexes currentlyhas O(N^2) performance.The second is that whatever solution we do use for this functionality,we still need good performance during searches that use the index.


No, unfortunately that's not the problem that needs to be solved.

The problem is figuring out WHICH records to put in the "left" and "right" 
trees once you split them.  If you can figure that out, then your suggestion (and perhaps other 
techniques) could be useful.

The problem boils down to this:  You have a whole bunch of essentially random 
bitmaps.  You have two buckets.  You want to put half of the bitmaps in one 
bucket, and half in the other bucket, and when you get through, you want all of 
the bitmaps in each bucket to be maximally similar to each other, and maximally 
dissimilar to the ones in the other bucket.

That way, when you OR all the bitmaps in each bucket together to build the 
bitmap for the left and right child nodes of the tree, you'll get maximum 
separation -- the chances that you'll have to descend BOTH the left and right 
nodes of the tree are minimized.

Unfortunately, this problem is very likely in the set of NP-complete problems, i.e. like the famous 
"Traveling Salesman Problem," you can prove there's no algorithm that will give the answer in a 
reasonable time.  In this case, "reasonable" would be measured in milliseconds to seconds, but in 
fact an actual "perfect" split of a set of bitmaps probably can't be computed in the lifetime of 
the universe for more than a few hundred bitmaps.

That's the problem that's being discussed: How do you decide which bitmaps go in each of the two buckets? Any solution will necessarily be imperfect, a pragmatic algorithm that gives an imperfect, but acceptable, answer.

As I mentioned earlier, chemists make extensive use of bitmaps to categorize 
and group molecules.  They use Tanimoto or Tversky similarity metrics (Tanimoto 
is a special case of Tversky), because it's extremely fast to compare two 
bitmaps, and the score is highly correlated with the number of bits the two 
bitmaps have in common.

But even with a fast "distance" metric like Tanimoto, there's still no easy way 
to decide which bucket to put each bitmap into.

Craig

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

Reply via email to