> it depends upon the type of query.. what queries are you using for > this benchmarking and how are you benchmarking? > FYI: for benchmarking standard query types with wikipedia you might be > interested in http://code.google.com/a/apache-extras.org/p/luceneutil/
I have 10000 queries from a AOL data set where the followed link lead to wikipedia. I benchmark by warming up the indexSearcher with 5000 and perform the test with the remaining 5000 queries. I just measure the time needed to execute the queries. I use QueryParser. > wait, you are indexing payloads for your tests with these other codecs > when it says "W POS" ? No only my last implementation uses payloads. All others not. Therefore I use a payload aware query for Huffman. > keep in mind that even adding a single payload to your index slows > down the decompression of the positions tremendously, because payload > lengths are intertwined with the positions. For block codecs payloads > really need to be done differently so that blocks of positions are > really just blocks of positions. This hasn't yet been fixed for the > sep nor the fixed layouts, so if you add any payloads, and then > benchmark positional queries then the results are not realistic. Oh I know that payloads slow down query processing but I wasn't aware of the block codec problem. I suggest you mean with not realistic they will be slower? Some numbers for Huffman: 20 Bytes segements.gen 234.6 KB fdt 1.8 MB fdx 20 bytes fnm 626.1 MB pos 1.7 GB pyl 17.8 MB skp 39.8 MB tib 2028.5 KB tiv 268 Bytes Segments_2 214.6 MB doc I used here for query processing PayloadQueryParser and adapt the similarity according to my payloads. > No they do not, only if you use a payload based query such as > PayloadTermQuery. Normal non-positional queries like TermQuery and > even normal positional queries like PhraseQuery don't fetch payloads > at all... Sorry my question was misleading. I already focused on a payload aware query. When I use one how exactly are the payload informations fetched from disk? For example if a query needs to read two posting lists. Are all payloads fetched for them directly or is Lucene at first making a boolean intersection and then retrieves the payloads for documents within that intersection? > From the description of what you are doing I don't understand how > payloads fit in because they are per-position? But, I haven't had the > time to digest the paper you sent yet. I will try to summarize it and how I adapted it to Lucene. I already mentioned the idea of two levels for versioned document collections. When I parse Wikipedia I unite for one article all terms of all versions. From this word bag I extract each distinct term and index it with Lucene into one document. Frequency information is now "lost" for the first level but will be stored on the second. This is what I meant with " The first level contains a posting for a document when a term occurs at least in one version". For example if an article has two versions like version1: "a b b" and version2: "a a a c c" only 'a','b' and 'c' are indexed. For the second level I collected term frequency information during my parsing step. Those frequencies are stored as a vector in version order. For the above example the frequency vector for 'a' would be [1,3]. I store these vectors as payloads which I see as the "second level". Every distinct term on first level receives a single frequency vector on its first position. So I somehow abuse payloads. For query processing I now need to retrieve the docs and payloads. It would be optimal to process the posting lists first ignoring payloads and then fetch payloads (frequency information) for the remaining docs. The term frequency is then used for ranking purposes. At the moment I pick for ranking the highest value from the freq vector which corresponds to the most matching version. Regards Alex --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org