Similarity class and searchPayloads
Hello everybody, I am just curious about following case. Currently, I create a boolean AND query which loads payloads. In some cases it occurs that Lucene loads payloads but does not return hits. Therefore, I assume that payloads are directly loaded whith each doc ID from the posting list before the boolean filter.Is that right? Is it possible to filter documents first and then load the payload? For example, I have three terms and I check in every posting list if the current doc ID is availabel. Only then I load payload. Or can anybody tell me where exactly Lucene loads payloads in code? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-class-and-searchPayloads-tp3041463p3041463.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene query processing
Hello everybody, As far as I know Lucene processes documents DAAT. Depending on the query either the intersection or union is calculated. For the intersection only documents occurring in all posting lists are scored. In the union case every document is scored which makes it a more expensive operation. Lucene stores its index into several files. Depending on the query different files might be accessed for scoring. For example a payload query needs to read paylods from .pos. What is not clear for me how term frequencies or payloads are processed. Assuming I store term frequencies I need to set setOmitTermFreqAndPositions(false). 1) Which queries include term frequencies? I assume all queries if term frequencies are stored? 2) Why is fetching payloads so much more expensive than getting term frequencies. Both are stored in seperated files and therefore demand a disk seek. 3) What for a value contains tf if I set setOmitTermFreqAndPositions(true)? Allways 1? 4) How are term freqs, payloads read from disk? In bulk for all remaining docs at once or every time a document gets scored? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-query-processing-tp2868144p2868144.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
> it depends upon the type of query.. what queries are you using for > this benchmarking and how are you benchmarking? > FYI: for benchmarking standard query types with wikipedia you might be > interested in http://code.google.com/a/apache-extras.org/p/luceneutil/ I have 1 queries from a AOL data set where the followed link lead to wikipedia. I benchmark by warming up the indexSearcher with 5000 and perform the test with the remaining 5000 queries. I just measure the time needed to execute the queries. I use QueryParser. > wait, you are indexing payloads for your tests with these other codecs > when it says "W POS" ? No only my last implementation uses payloads. All others not. Therefore I use a payload aware query for Huffman. > keep in mind that even adding a single payload to your index slows > down the decompression of the positions tremendously, because payload > lengths are intertwined with the positions. For block codecs payloads > really need to be done differently so that blocks of positions are > really just blocks of positions. This hasn't yet been fixed for the > sep nor the fixed layouts, so if you add any payloads, and then > benchmark positional queries then the results are not realistic. Oh I know that payloads slow down query processing but I wasn't aware of the block codec problem. I suggest you mean with not realistic they will be slower? Some numbers for Huffman: 20 Bytes segements.gen 234.6 KB fdt 1.8 MB fdx 20 bytes fnm 626.1 MB pos 1.7 GB pyl 17.8 MB skp 39.8 MB tib 2028.5 KB tiv 268 Bytes Segments_2 214.6 MB doc I used here for query processing PayloadQueryParser and adapt the similarity according to my payloads. > No they do not, only if you use a payload based query such as > PayloadTermQuery. Normal non-positional queries like TermQuery and > even normal positional queries like PhraseQuery don't fetch payloads > at all... Sorry my question was misleading. I already focused on a payload aware query. When I use one how exactly are the payload informations fetched from disk? For example if a query needs to read two posting lists. Are all payloads fetched for them directly or is Lucene at first making a boolean intersection and then retrieves the payloads for documents within that intersection? > From the description of what you are doing I don't understand how > payloads fit in because they are per-position? But, I haven't had the > time to digest the paper you sent yet. I will try to summarize it and how I adapted it to Lucene. I already mentioned the idea of two levels for versioned document collections. When I parse Wikipedia I unite for one article all terms of all versions. From this word bag I extract each distinct term and index it with Lucene into one document. Frequency information is now "lost" for the first level but will be stored on the second. This is what I meant with " The first level contains a posting for a document when a term occurs at least in one version". For example if an article has two versions like version1: "a b b" and version2: "a a a c c" only 'a','b' and 'c' are indexed. For the second level I collected term frequency information during my parsing step. Those frequencies are stored as a vector in version order. For the above example the frequency vector for 'a' would be [1,3]. I store these vectors as payloads which I see as the "second level". Every distinct term on first level receives a single frequency vector on its first position. So I somehow abuse payloads. For query processing I now need to retrieve the docs and payloads. It would be optimal to process the posting lists first ignoring payloads and then fetch payloads (frequency information) for the remaining docs. The term frequency is then used for ranking purposes. At the moment I pick for ranking the highest value from the freq vector which corresponds to the most matching version. Regards Alex - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
Hi Robert, the adapted codec is running but it seems to be incredible slow. Will take some time ;) Here are some performance results: Indexing scheme Index Size Avg. Query performance Max. Query Performance PforDelta2 W Freq W Pos 20.6 GB (3,3 GB w/o .pos) 81.97 ms 1295 ms PforDelta2 W/O Freq W/O Pos 1.6 GB 63.33 ms 766 ms Standard 4.0 W Freq W Pos 28.1 GB (8,1 GB w/o .prx) 77.71 ms 978 ms Standard 4.0 W/O Freq W/O Pos 6.2 GB 59.93 ms 718 ms Standard 3.0 W Freq W Pos 28.1 GB (8,1 GB w/o .prx) 71.41 ms 978 ms Standard 3.0 WO Freq WO Pos 6.2 GB 72.72 ms 845 ms PforDelta W Freq W Pos 22 GB (5 GB w/o .pos) 67.98 ms 783 ms PforDelta W/O Freq W/O Pos 3.1 GB 56.08 ms 596 ms Huffman BL10 W Freq W/O Pos 2.6 GB 216.29 ms (Mem 14 ms) 1338 ms I am a little bit curious about the Lucene 3.0 performance results because the larger index seems to work faster?!? I already ran the test several times. Are my results realistic at all? I thought PForDelta/2 would outperform the standard index implementations in query processing. The last result is my own implementation. I am still looking to get it smaller because I think I can improve compression further. For indexing I use PForDelta2 in combination with payloads. Those are causing the higher runtimes. In memory it looks nice. The gap between my solution and PForDelta is already 700 MB. I would say it is an improvement. :D I will have a look at it again after I got an index with your adapted implementation. I still have another question. The basic idea in my implementation is to create a "Two-Level" index structure. It is specialized for versioned document collections. On the first level I create a posting list entry for a document whenever a term occurs in one or more of its versions. The second level holds corresponding term frequency informations. Is it possible to build such a structure by creating a codec? For query processing it should filter per boolean query on the first level and only fetch information from the second level when the document is in the intersection of the first level. At the moment I use payloads to "simulate" a two-level structure. Normally all payloads corresponding to a query get fetched, right? If this structure would be possible there are several more implementations with promising results (Two-Level Diff/MSA in this paper http://cis.poly.edu/suel/papers/version.pdf). Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p284.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Re: New codecs keep Freq skip/omit Pos
Wow cool , I will give that a try! Thank you!! Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2852370.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
I also indexed one time with Lucene 3.0. Are those sizes really completely the same? Standard 4.0 W Freq W Pos 28.1 GB Standard 4.0 W/O Freq W/O Pos 6.2 GB Standard 3.0 W Freq W Pos 28.1 GB Standard 3.0 WO Freq WO Pos 6.2 GB Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851898.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
Hello Robert, thank you for the answers! :) Currently I used PatchedFrameOfRef and PatchedFrameOfRef2. Therefore both implementations are PForDelta! Sorry my mistake. PatchedFrameOfRef2: PforDelta W/O Freq W/O Pos 1.6 GB PatchedFrameOfRef : Pfor W/O Freq W/O Pos 3.1 GB Here are some numbers: PatchedFrameOfRef2 w/o POS w/o FREQ segements.gen 20 Bytes _43.fdt 8,1 MB _43.fdx 64,4 MB _43.fnm 20 Bytes _43_0.skp 182,6 MB _43_0.tib 32,3 MB _43_0.tiv 1,0 MB segements_2 268 Bytes _43_0.doc 1,3 GB PatchedFrameOfRef w/o POS w/o FREQ segements.gen 20 Bytes _43.fdt 8,1 MB _43.fdx 64,4 MB _43.fnm 20 Bytes _43_0.skp 182,6 MB _43_0.tib 32,3 MB _43_0.tiv 1,1 MB segements_2 267 Bytes _43_0.doc 2,8 GB During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter, StopFilter). Can I get somewhere more information for Codec creation or is there just "grubbing" through the code? My own implementation needs 2,8 GB of space including FREQ but not POS. This is why I am asking because I want somehow compare the result. Compared to 20 GB it is very nice and compared to 1,6 GB it is very bad ;). Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851809.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
New codecs keep Freq skip/omit Pos
Hello everybody, I am currently testing several new Lucene 4.0 codec implementations to compare with an own solution. The difference is that I am only indexing frequencies and not positions. I would like to have this for the other codecs. I know there was already a post for this topic http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html. I just wanted to ask if there has something changed especially for the new codecs. I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those they right places for adapting Pos/Freq handling? What would happen if I just skip writing postions/payloads? Would it mess up the index? The written files have different endings like pyl, skp, pos, doc etc. Gives me "not counting" the pos file a correct index size estimation for W Freqs W/O Pos? Or where exactly are term positions written? Regards Alex PS: Some results with the current codecs if someone is interested. I indexed 10% of Wikipedia(english). Each version is indexed as document. Docs240179 Versions8467927 Distinct Terms 3501214 total Terms 1520008204 Avg. Versions 35.25 Avg. Terms per Version 179.50 Avg. Terms per Doc 6328.65 PforDelta W Freq W Pos 20.6 GB PforDelta W/O Freq W/O Pos 1.6 GB Standard 4.0 W Freq W Pos 28.1 GB Standard 4.0 W/O Freq W/O Pos6.2 GB Pfor W Freq W Pos 22 GB Pfor W/O Freq W/O Pos3.1 GB Performance follows ;) -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2849776.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.0 Payloads
Hello everybody, I am currently experimenting with Lucene 4.0 and would like to add payloads. Payload should only be added once per term on the first position. My current code looks like this: public final boolean incrementToken() throws java.io.IOException { String term = characterAttr.toString(); if (!input.incrementToken()) { return false; } // hmh contains all terms for one document if(hmh.checkKey(term)){ // check if hashmap contains term Payload payload = new Payload(hmh.getCompressedData(term)); //get payload data payloadAttr.setPayload(payload); // add payload hmh.removeFromIndexingMap(term); // remove term from hashmap } return true; } Is this a correct way for adding payloads in Lucene 4.0? When I try to receive payloads I am not getting payload on the first position. For getting payloads I use this: DocsAndPositionsEnum tp = MultiFields.getTermPositionsEnum(ir, MultiFields.getDeletedDocs(ir), fieldName, new BytesRef(searchString)); while (tp.nextDoc() != tp.NO_MORE_DOCS) { if (tp.hasPayload() && counter < 10) { Document doc = ir.document(tp.docID()); BytesRef br = tp.getPayload(); System.out.println("Found payload \"" + br.utf8ToString() + "\" for document " + tp.docID() + " and query " + searchString + " in country " + doc.get("country")); } } As far as I know there are two possibilities to use payloads 1) During similarity scoring 2) During search Is there a better/faster way to receive payloads during search? Is it possible to run a normal query and read the payloads from hits? Is 1 or 2 the faster way to use payloads? Can I find somewhere example code for Lucene and loading payloads? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-4-0-Payloads-tp2695817p2695817.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Early Termination
Hi, is Lucene capable of any early termination techniques during query processing? On the forum I only found some information about TimeLimitedCollector. Are there more implementations? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Early-Termination-tp2684557p2684557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How are stored Fields/Payloads loaded
Hello everybody, I am currently unsure how stored data is written and loaded from index. I want to store for every term of a document some binary data but only once and not for every position! Therefore I am not sure if Payloads or stored Fields are the better solution (Or the not implemented feature Column Stride Field). As far as I know all fields of a document are loaded by Lucene during search. With large stored fields this can be time consuming and therefore exists the possibility to load specific fields with FieldSelector. Maybe I could create for each term a stored Field (up to several thousand Fields!) and read those fields depending on the query term. Is this a common approach? The other possibility (like I have implemented it at the moment) is to store per term a payload but only on the first term position. Payloads are loaded only if I retrieve them from a hit right? So my current posting list looks like this: http://lucene.472066.n3.nabble.com/file/n2598739/Payload.png Picture adapted from M. McCandless "Fun with Flex" How will the feature Column Stride Field (or per-document field) work? It's not clear for me what "per Document" exactly means for the posting list entries. I think (hope :P) it works like this: http://lucene.472066.n3.nabble.com/file/n2598739/CSD.png Picture adapted from M. McCandless "Fun with Flex" Do I understand the Column Stride Field correct? What would give me the best performance (Stored Field, Payload, CSD)? Are there other ways to retrieve payloads during search than Spanquery (I would like to use a normal query here)? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/How-are-stored-Fields-Payloads-loaded-tp2598739p2598739.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Storing payloads without term-position and frequency
Hello everybody, I am currently using Lucene 3.0.2 with payloads. I store extra information in the payloads about the term like frequencies and therefore I don't need frequencies and term positions stored normally by Lucene. I would like to set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only store one payload per term if that information makes it easier. Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Could not find implementing class
Hello Uwe, I recompiled some classes manually in Lucene sources. No it's running fine! Something went wrong there. Thank you very much! Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2332141.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Could not find implementing class
Hello Alexander, isn't it enough to add the classpath through -cp? If I don't use -cp I can't compile my project. I thought after compiling without errors all sources are correctly added. In Eclipse I added Lucene sources the same way(which works) and I also tried using the jar file. Therefore I seem to find all classes but I don't get a clue with the error message. This error message is thrown by the Lucene class DefaultAttributeFactory in org.apache.lucene.util.AttributeSource. I work under Ubuntu and configured java with - sudo update-alternatives --config java - sudo update-java-alternatives -java-6-sun Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2331617.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Could not find implementing class
Hello everybody, I used a small indexing example from "Lucene in Action" and can run and compile the program under eclipse. If I want to compile and run it by console I get this error: java.lang.IllegalArgumentException: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.TermAttribute at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:87) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:66) at org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:245) at org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:41) at org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:36) at org.apache.lucene.index.DocInverterPerThread.(DocInverterPerThread.java:34) at org.apache.lucene.index.DocInverter.addThread(DocInverter.java:95) at org.apache.lucene.index.DocFieldProcessorPerThread.(DocFieldProcessorPerThread.java:62) at org.apache.lucene.index.DocFieldProcessor.addThread(DocFieldProcessor.java:88) at org.apache.lucene.index.DocumentsWriterThreadState.(DocumentsWriterThreadState.java:43) at org.apache.lucene.index.DocumentsWriter.getThreadState(DocumentsWriter.java:739) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:814) at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:802) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1998) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1972) at Demo.setUp(Demo.java:86) at Demo.main(Demo.java:46) I compile the command with javac -cp Demo.java which finishes without errors but running the program isn't possible. What am I missing?? Basically I am just creating a directory, getting an indexwriter with analyzer etc.. Line 86 in Demo.java is writer.addDocument(doc);. Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2330598.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Indexing large XML dumps
Hello everybody, I am currently indexing wikipedia dumps and create an index for versioned document collections. As far everything is working fine but I have never thought that single articles of wikipedia would reach a size of around 2 GB! One article for example has 2 versions with an average length of 6 characters for each (HUGE in memory!). This means I need a heap space around 4 GB to perform indexing and I would like to decrease my memory consumption ;). At the moment I load every wikipedia article completely into memory containing all versions. Then I collect some statistical data about the article to store extra information about term occurences which are written into the index as payloads. The statistic is created during an own tokenization run which happens before the document is written into index. This means I am analyzing my documents twice! :( I know there is a CachingTokenFilter but I haven't found how and where to implement it exactly (I tried it in my Analyzer but stream.reset() seems not to work). Does somebody have a nice example? 1) Can I somehow avoid loading one complete article to get my statistics? 2) Is it possible to index large files without completely loading it into a field? 3) How can I avoid to parse an article twice? Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing indexing of Versioned Document Collections
Hi again, my Payloads are working fine as I figured out now (haven't seen the nextPosition method). I really have problems with adding the bitvectors. Currently I am creating them during tokenization. Therefore, as already mentioned, they are only completely created when all fields are tokenized because I add every new term occurence into HashMap and create/update the linked bitvector during this analysis process. I read in another post that changing or updating already set payloads isn't possible. Furthermore I need to store payload only ONCE for a term and not on every term position. For example on the wiki article for April I would have around 5000 term occurrences for the term "April"! This would save a lot of memory. 1) Is it possible to pre analyze fields? Maybe analyzing twice. First time for getting the bitvectors (without writing them!) and second time for normal index writing with bitvector payloads. 2) Alternatively I could still add the bitvectors during tokenization if I would be able to set the current term in my custom Filter (extends TokenFilter). In my HashMap I have pairs of and I could iterate over all term keys. Is it possible to manually set the current term and the corresponding payload? I tried something like this after all fields and streams have been tokenized (Without success): for (Map.Entry e : map.entrySet()) { key = e.getKey(); value = e.getValue(); termAtt.setTermBuffer(key); bitvectorPalyoad = new Payload(toByteArray(value)); payloadAttr.setPayload(bitvectorPalyoad); } 3) Can I use payloads without term positions? If my questions are unclear please tell me! :) Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1913140.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing indexing of Versioned Document Collections
Hello Pulkit, thank you for your answer and excuse me for my late reply. I am currently working on the payload stuff and have implemented my own Analyzer and Tokenfilter for adding custom payloads. As far as I understand I can add Payload for every term occurence and write this into the posting list. My posting list now looks like this: car -> DocID1, [Payload 1], DocID2, [Payload2]., DocID N, [Payload N] Where each payload is a BitSet depending on the versions of a document. I must admit that the index is getting really big at the moment because I am adding around 8 to 16 bytes with each payload. I have to find a good compression for the bitvectors. Further I am always getting the error org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments file if I use my own Analyzer. After I uncomment the checksum test everything works fine. Even Luke isn't giving me an error. Any ideas? Another problem is the BitVector creation during tokenization. I am running through all versions during the tokenizing step for creating my bitvectors (stored in a HashMap). So my bitvectors are completly created after the last field is analyzed (I added every wikipedia verison as an own field). Therefore I need to add the payload after the tokenizing step. Is this possible? What happens if I add payload for a current term and I add another payload for the same term later ? Is it overwritten or appended? Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Implementing indexing of Versioned Document Collections
Hello everybody, I would like to implement the paper "Compact Full-Text Indexing of Versioned Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene. The basic idea is to create a two-level index structure. On the first level a document is identified by document ID with a posting list entry if the term exists at least in one version. For every posting on the first level with term t we have a bitvector on the second one. These bitvectors contain as many bits as there are versions for one document, and bit i is set to 1 if version i contains term t or otherwise it remains 0. http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg This little picture is just for demonstration purposes. It shows a posting list for the term car and is composed of 4 document IDs. If a hit is found in document 6 another look-up is needed on the second level to get the corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all). At the moment I am using wikipedia (simplewiki dump) as source with a SAXParser and can resolve each document with all its versions from the XML file (Fields are Title, ID, Content(seperated for each version)). My problem is that I am unsure how to connect the second level with the first one and how to store it. The key points that are needed: - Information from posting list creation to create the bitvector (term -> doc -> versions) - Storing the bitvectors - Implementing search on second level For the first steps I disabled term frequencies and positions because the paper isn't handling them. I would be happy to get any running version at all. :) At the moment I can create bitvectors for the documents. I realized this with a HashMap in TermsHashPerField where I grab the current term in add() (I hope this is the correct location for retrieving the inverted lists terms). Anyway I can create the corret bitvectors and write them into a text file. Excerpt of bitVectors from article "April": april : 110110111011 never : 0010 ayriway : 010110111011 inclusive : 1000 Next step would be storing all bitvecors in the index. At first glance I like to use an extra field to store the created bitvectors permanent in the index. It seems to be the easiest way for a first implementation without accessing the low level functions of Lucene. Can I add a field after I already started writing the document through IndexWriter? How would I do this? Or are there any other suggestions for storing? Another idea is to expand the index format of Lucene but this seems a little bit to difficult for me. Maybe I could write these information into my own file. Could anybody point me to the right direction? :) Currently I am focusing on storing and try to extend Lucenes search after the former step. THX in advance & best regards Alex [1] http://cis.poly.edu/suel/ -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Detailed file handling on hard disk
Hello everybody, I read the paper http://www2008.org/papers/pdf/p387-zhangA.pdf Performance of Compresses Inverted List Caching in Search Engines and now I am unsure how Lucene implements its structure on the hard disk. I am using Windos as OS and therefore I implemented FSDirectory based on Java.io.RandomAccessFile. How is the skipping in the .tis file realized? Do I use metadata at the beginning of each block too like in the mentioned paper above on page 388 (in the paper the metadata stores informations about how many inverted lists are in the block and where they start)? http://lucene.472066.n3.nabble.com/file/n1413062/Block_assignment.jpg Because I read in another article that I can seek to the correct position on the hard drive with the byte address using java.io.RandomAccessFile (which I can read from .tii-file in "IndexDelta"?). How do I find the correct position/location for my PostingList/Document? Do I need information/metadata about the blocks from the underlying file system? Or where can I find further informations about this stuff? :) Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Detailed-file-handling-on-hard-disk-tp1413062p1413062.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org