Hi Robert,
the adapted codec is running but it seems to be incredible slow. Will take
some time ;)
Here are some performance results:
Indexing scheme
Index Size
Avg. Query performance
Max. Query Performance
PforDelta2 W Freq W Pos
20.6 GB (3,3 GB w/o .pos)
81.97 ms
1295 ms
PforDelta2 W/O Freq W/O Pos
1.6 GB
63.33 ms
766 ms
Standard 4.0 W Freq W Pos
28.1 GB (8,1 GB w/o .prx)
77.71 ms
978 ms
Standard 4.0 W/O Freq W/O Pos
6.2 GB
59.93 ms
718 ms
Standard 3.0 W Freq W Pos
28.1 GB (8,1 GB w/o .prx)
71.41 ms
978 ms
Standard 3.0 WO Freq WO Pos
6.2 GB
72.72 ms
845 ms
PforDelta W Freq W Pos
22 GB (5 GB w/o .pos)
67.98 ms
783 ms
PforDelta W/O Freq W/O Pos
3.1 GB
56.08 ms
596 ms
Huffman BL10 W Freq W/O Pos
2.6 GB
216.29 ms (Mem 14 ms)
1338 ms
I am a little bit curious about the Lucene 3.0 performance results because
the larger index seems to
work faster?!? I already ran the test several times. Are my results
realistic at all? I thought PForDelta/2 would outperform the standard index
implementations in query processing.
The last result is my own implementation. I am still looking to get it
smaller because I think I can improve compression further. For indexing I
use PForDelta2 in combination with payloads. Those are causing the higher
runtimes. In memory it looks nice. The gap between my solution and PForDelta
is already 700 MB. I would say it is an improvement. :D I will have a look
at it again after I got an index with your adapted implementation.
I still have another question. The basic idea in my implementation is to
create a "Two-Level" index structure. It is specialized for versioned
document collections. On the first level I create a posting list entry for a
document whenever a term occurs in one or more of its versions. The second
level holds corresponding term frequency informations. Is it possible to
build such a structure by creating a codec? For query processing it should
filter per boolean query on the first level and only fetch information from
the second level when the document is in the intersection of the first
level. At the moment I use payloads to "simulate" a two-level structure.
Normally all payloads corresponding to a query get fetched, right?
If this structure would be possible there are several more implementations
with promising results (Two-Level Diff/MSA in this paper
http://cis.poly.edu/suel/papers/version.pdf).
Regards Alex
--
View this message in context:
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2855554.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.