First I forgot to mention that in the benchit.cc code
there is a hack (0x400000 flag) to remove the DUP flag when opening
the database. 

       My final conclusion on prefix compression is that it does not
save space. I can't really figure out why at present. I double checked
the bench code and I see no flaw.

       Here are the results for 594670 keys:

Without prefix compression routine:

make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (10 documents) 
12.85user 0.46system 0:14.42elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+512minor)pagefaults 0swaps
no prefix : 31 bytes per word
Reading from words.all ... pushed 59467 words (10 documents) 
13.52user 0.48system 0:15.32elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+512minor)pagefaults 0swaps
prefix : 41 bytes per word
Reading from words.all ... pushed 59467 words (10 documents) 
13.34user 0.55system 0:15.12elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+512minor)pagefaults 0swaps
suffix : 41 bytes per word
make[1]: Leaving directory `/home/loic/local/ports/htdig3/bench'

With default prefix compression routine

make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (10 documents) 
12.85user 0.47system 0:14.54elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+510minor)pagefaults 0swaps
no prefix : 31 bytes per word
Reading from words.all ... pushed 59467 words (10 documents) 
13.47user 0.53system 0:15.35elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+510minor)pagefaults 0swaps
prefix : 41 bytes per word
Reading from words.all ... pushed 59467 words (10 documents) 
13.44user 0.50system 0:15.63elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (920major+510minor)pagefaults 0swaps
suffix : 41 bytes per word
make[1]: Leaving directory `/home/loic/local/ports/htdig3/bench'

         As far as I can see the prefix compression routine (__bam_defpfx)
is only used to eat CPU. This is a major drawback, IMHO. I think we should
get feedback from sleepycat on this subject. Could they give us the 
optimal key pattern that would make the prefix compression code save space,
if such a pattern exists ? I won't be able to contact them before next week
because the connection here is getting worse every day. 

         I've also run the test on 5 946 700 keys, to see if prefix 
compression only occurs when really big amount of keys is involved. It would
make sense since the prefix compression routine is only called when splitting
nodes. It is not called for each inserted key.
         I got the following:

Without prefix compression routine:

make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (100 documents) 
194.37user 6.43system 3:38.57elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1025major+517minor)pagefaults 0swaps
no prefix : 54 bytes per word
Reading from words.all ... pushed 59467 words (100 documents) 
210.92user 9.04system 4:02.06elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1196major+516minor)pagefaults 0swaps
prefix : 71 bytes per word
Reading from words.all ... pushed 59467 words (100 documents) 
198.07user 9.73system 3:50.63elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1203major+516minor)pagefaults 0swaps
suffix : 71 bytes per word
make[1]: Leaving directory `/home/loic/local/ports/htdig3/bench'

With prefix compression routine:

make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (100 documents) 
193.56user 7.52system 3:37.94elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1143major+512minor)pagefaults 0swaps
no prefix : 54 bytes per word
Reading from words.all ... pushed 59467 words (100 documents) 
210.88user 9.25system 4:04.83elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1204major+512minor)pagefaults 0swaps
prefix : 71 bytes per word
Reading from words.all ... pushed 59467 words (100 documents) 
198.76user 9.16system 3:52.98elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1205major+512minor)pagefaults 0swaps
suffix : 71 bytes per word
make[1]: Leaving directory `/home/loic/local/ports/htdig3/bench'

         As may be seen, the prefix compression has no effect. As a side
effect of this test we also see that the overhead of internal datastructures
used to manage the btree grows as the number of keys grows. Roughly, we
multiply the number of keys by 10 and have 23 bytes additional overhead
(31 + 23 = 54) for regular keys and 30 bytes additional overhead 
(41 + 30 = 70) for keys with 10 bytes prefix or suffix.

         Assuming that the additional overhead remains stable, we will
have 54 + 23 = 77 bytes for each word if we have 50 000 000 words
occurence.  This figure approximately matches the number of occurences
we can expect for 500 000 documents (100 words per document, a low
estimation). For 500 000 documents, then, the size of the word file
will be 3 850 000 000 bytes. 
         Of course this estimation is wrong because the data I've 
associated with each entry is the word itself. However the average size
of a word in the 59467 word list I've used is 8. I assume that this is
a reasonable size for the data associated with each word (did not check
that, I may be wrong). The documentid stored in the key after the \001
is in ascii in my bench and could be transformed to binary if an 
appropriate compare function is provided to DB. This has no impact on
the tests since I use document numbers from 1 to 100, hence only using
up to three bytes which is less than a 4 bytes binary encoding. 
In conclusion I would say that the bench is close to the real usage.
Of course I expect you to flame me if it's not the case ;-)

          In conclusion we have a problem. 4gig word index for a 500
000 documents set is too big (optimistic internet standard 4k size,
100 word occurence per doc, the index is twice the size of the original
data). I strongly think that this could be dramatically
reduced if prefix compression worked. If we can't make it work, we're toasted.

          I somehow hope I'm completely wrong :-)

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to