[jira] Resolved: (LUCENE-1411) Enable IndexWriter to open an arbitrary commit point
[ https://issues.apache.org/jira/browse/LUCENE-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1411. Resolution: Fixed Enable IndexWriter to open an arbitrary commit point Key: LUCENE-1411 URL: https://issues.apache.org/jira/browse/LUCENE-1411 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1411.patch With a 2-phase commit involving multiple resources, each resource first does its prepareCommit and then if all are successful they each commit. If an exception or timeout/power loss is hit in any of the resources during prepareCommit or commit, all of the resources must then rollback. But, because IndexWriter always opens the most recent commit, getting Lucene to rollback after commit() has been called is not easy, unless you make Lucene the last resource to commit. A simple workaround is to simply remove the segments_N files of the newer commits but that's sort of a hassle. To fix this, we just need to add a ctor to IndexWriter that takes an IndexCommit. We recently added this for IndexReader (LUCENE-1311) as well. This ctor is definitely an expert method, and only makes sense if you have a custom DeletionPolicy that preserves more than just the most recent commit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1382) Allow storing user data when IndexWriter.commit() is called
[ https://issues.apache.org/jira/browse/LUCENE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1382. Resolution: Fixed Fix Version/s: 2.9 Allow storing user data when IndexWriter.commit() is called --- Key: LUCENE-1382 URL: https://issues.apache.org/jira/browse/LUCENE-1382 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1382.patch Spinoff from here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg22303.html The idea is to allow optionally passing an opaque String commitUserData to the IndexWriter.commit method. This String would be stored in the segments_N file, and would be retrievable by an IndexReader. Applications could then use this to assign meaning to each commit. It would be nice to get this done for 2.4, but I don't think we should hold the release for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1426) Next steps towards flexible indexing
Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1426: --- Attachment: LUCENE-1426.patch Attached patch. I think it's ready to commit... I'll wait a few days. This factors the writing of postings into separate Format* classes. The approach I took is similar to what I did for DocumentsWriter, where there is a hierarchical consumer interface (abstract class) for each of fields, terms, docs, and positions writing. Then there's a corresponding set of concrete classes (the codec chain) that write today's index format. There is no change to the index format. Here are the details: * This only applies to postings (not stored fields, term vectors, norms, field infos) * Both SegmentMerger FreqProxTermsWriter now use the same codec API to write postings. I think this is a big step forward: we now have a single set of classes that ever write the postings. * You can't yet customize this codec chain; we can add that at some point. It's all package private. * I don't yet allow the codec to override SegmentInfo.files(); at some point (when I first try to make a codec that uses different files) I will add this. I ran a quick performance test, indexing wikipedia, and found negligible performance cost of this. The next step, which is trickier, is to modularize/genericize the classes the read from the index, and then refactor SegmentTerm{Enum,Docs,Positions} to use that codec API. Then, finally, I want to make a codec that uses PFOR to encode postings. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TokenStream and Token APIs
On Oct 19, 2008, at 7:08 PM, Michael Busch wrote: Grant Ingersoll wrote: On Oct 19, 2008, at 12:56 AM, Mark Miller wrote: Grant Ingersoll wrote: Bear with me, b/c I'm not sure I'm following, but looking at https://issues.apache.org/jira/browse/LUCENE-1422 , I see at least 5 different implemented Attributes. So, let's say I add a 5 more attributes and now have a total of 10 attributes. Are you saying that I then would have, potentially, 10 different variables that all point to the token as in the code snippet above where the casting takes place? Or would I just create a single Super attribute that folds in all of my new attributes, plus any other existing ones? Or, maybe, what I would do is create the 5 new attributes and then 1 new attribute that extends all 10, thus allowing me to use them individually, but saving me from having to do a whole ton of casting in my Consumer. Potentially one consumer doing 10 things, but not likely right? I mean, things will stay logical as they are now, and rather than a super consumer doing everything, we will still have a chain of consumers each doing its own piece. So more likely, maybe something comes along every so often (another 5, over *much* time, say) and each time we add a Consumer that uses one or two TokenStream types. And then its just an implementation detail on whether you make a composite TokenStream - if you have added 10 new attributes and see it fit to make one consumer use them all, sure, make a composite, super type, but in my mind, the way its done in the example code is clearer/cleaner for a handful of TokenStream types. And even if you do make the composite,super type, its likely to just be a sugar wrapper anyway - the implementation for say, payload and positions, should probably be maintained in their own classes anyway. Well, there are 5 different attributes already, all of which are commonly used. Seems weird to have to cast the same var 5 different ways. Definitely agree that one would likely deal with this by wrapping, but then you end up either needing to extend your wrapper or add new wrappers... Well yes, there are 5 attributes, but n neither of the core tokenstreams and -filters that I changed in my patch did I have to use more than two or three of those. Currently the only attributes that are really used are PositionIncrementAttribute and PayloadAttribute. And the OffsetAttribute when TermVectors are turned on. Even in the indexing chain currently we don't have a single consumer that needs all attributes. The FreqProxWriter needs positions and payloads, the TermVectorsWriter needs positions and offsets. I have an application that uses all the attributes of a Token, or at least, almost all of them. There are many uses for Lucene's analysis code that have nothing to do with indexing, Consumers or even Lucene. Also, you don't have to cast the same variable multiple times. In the current patch you would call e. g. token.getAttribute(PayloadAttribute.class) and keep a reference to it in the consumer or filter. IMO even calling getAttribute() 5 times or so and storing the references wouldn't be so bad. And if you really don't like it you could make a wrapper as you said. You also mentioned the disadvantages of the wrapper, e. g. that you would have to extend it to add new attributes. But then, isn't that the same disadvantage the current Token API has? True. I didn't say the idea was bad, in fact I mostly like it, I was just saying I'd like to explore how it would work in practice and the main thing that struck me was all the casting or all the references. Since it's likely that you only deal with a Token one at a time, you're right, it's probably not a big deal other than the code looks funny, IMO. You could even use the new API in exact the same way as the old one. Just create a subclass of Token that has all members you need and don't add any attributes. So I think the new API adds more flexibility, and still offers to use it in the same way as the old one. I however think the recommended best practice should be to use the new attributes, for reusability of consumers that only need certain attributes. Perhaps it would be useful for Lucene to offer exactly one subclass of Token that we guarantee will always have all known Attributes (i.e. the ones Lucene provides) available to it for casting purposes. However, please let me know if you have any concrete recommendations about changing the API in LUCENE-1422. I thought those concerns were pretty concrete... :-) There might be better ones than the APIs I came up with. I think the APIs in the 2nd patch look pretty reasonable. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
[ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641075#action_12641075 ] Grant Ingersoll commented on LUCENE-1406: - Committed revision 706342. I made some small changes to reuse Tokens, also added in some comments into the stopwords list and added to WordListLoader to accommodate this Thanks Robert! new Arabic Analyzer (Apache license) Key: LUCENE-1406 URL: https://issues.apache.org/jira/browse/LUCENE-1406 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Robert Muir Assignee: Grant Ingersoll Priority: Minor Attachments: LUCENE-1406.patch I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL. However, it is not necessary to have full morphological analysis engine for a quality arabic search. This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision. While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented. For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed. This implementation (Analyzer) consists of above mentioned stopword list plus two filters: ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc) ArabicStemFilter: performs arabic light stemming Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer. There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that. If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641121#action_12641121 ] Paul Elschot commented on LUCENE-1426: -- bq. We inline payloads with positions which would also mess up the int blocks. Which begs the question whether we should also allow compression of these payloads. I think we should do that because normally only one or two bytes will be used as payload per position. Thinking about this: position+payload actually looks a lot like docId+freq, could that be used to simplify future index formats for inverted terms? Btw. allowing a payload to accompany the field norms would allow to store a kind of dictionary for the position payloads. This could help to keep the position payloads small so they would compress nicely. bq. Both SegmentMerger FreqProxTermsWriter now use the same codec API to write postings. That is indeed a big step. bq. It's all package private. Good for now, making it public might actually reduce flexibility for new index formats. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641125#action_12641125 ] Paul Elschot commented on LUCENE-1426: -- bq. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Does the offset imply that there is also a need for random access into each block? For such blocks PFOR patching might better be avoided. Even with patching random access is possible, but it is not available yet at LUCENE-1410. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641128#action_12641128 ] Eks Dev commented on LUCENE-1426: - Just a few random thoughts on this topic - I am sure I read somewhere in these pdfs that were floating around that it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it. - During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. This way we spare one seek per term in many cases, as well as some space for storing offset. I do not know if this is a problem, but sounds reasonable. With standard Zipfian distribution, a lot of postings should get inlined. Use cases where we have query expansion on many terms (think spell checker, synonyms ...) should benefit from that heavily. These postings are small but there is a lot of them, so it adds up... seek is deadly :) I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this :) cheers, eks Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641132#action_12641132 ] Doug Cutting commented on LUCENE-1426: -- +1 This sounds like a great way to approach flexible indexing: incrementally. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641137#action_12641137 ] Michael McCandless commented on LUCENE-1426: bq. During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. Yes, there's this issue: https://issues.apache.org/jira/browse/LUCENE-1278 And you had found this one: http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf And then Doug referenced this: http://citeseer.ist.psu.edu/cutting90optimizations.html I think the idea makes tons of sense (saving a seek) and one of my goals in phase 2 (genericizing the reading of an index) is to make pulsing a drop-in codec as an example litmus test. Terms iteration may suffer, though, unless we put this in a separate file. I also think, at the opposite end of the spectrum, it would make sense for very common terms to use simple n-bit packing (PFOR minus the exceptions). For massive terms we need the fastest search we can get, since that gates when you have to start sharding. bq. I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this Well the stock market seems to think the credit crunch is improving, today... of course who knows what'll happen tomorrow! Good luck :) Also, I'd like to explore improving the terms dict indexing -- I don't think we need to load a TermInfo instance for every indexed term, into RAM. I think we just need the term seek data (into the tis file), then you seek there and skip to the TermInfo you need. This should save a good amount of RAM for large indices with odd terms, sicne each TermInfo instance requires a pointer to it (4 or 8 bytes), an object header (8 bytes at least) then 20 bytes for the members. All these explorations should become simple drop-in codecs, once I can finish phase 2. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641139#action_12641139 ] Michael McCandless commented on LUCENE-1426: {quote} Does the offset imply that there is also a need for random access into each block? For such blocks PFOR patching might better be avoided. Even with patching random access is possible, but it is not available yet at LUCENE-1410. {quote} Yeah this is one of the reasons why I'm thinking for frequent terms we may want to fallback to pure nbit packing (which would make random access simple). But, for starters would could simply implement random access as load decode the entire block, then look at the part you want and then assess the cost. While it will clearly increase the cost of queries that do alot of skipping (eg AND query of N terms), it may not matter so much since these queries should be fairly fast now. It's the OR of frequent term queries that we need to improve since that limits how big an index you can put on one box. Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641140#action_12641140 ] Michael McCandless commented on LUCENE-1426: bq. Which begs the question whether we should also allow compression of these payloads. I think that's interesting, but would probably be rather application dependent. {quote} Btw. allowing a payload to accompany the field norms would allow to store a kind of dictionary for the position payloads. This could help to keep the position payloads small so they would compress nicely. {quote} Couldn't stored fields, once they are faster (with column-stride fields, LUCENE-1231) solve this? Next steps towards flexible indexing Key: LUCENE-1426 URL: https://issues.apache.org/jira/browse/LUCENE-1426 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1426.patch In working on LUCENE-1410 (PFOR compression) I tried to prototype switching the postings files to use PFOR instead of vInts for encoding. But it quickly became difficult. EG we currently mux the skip data into the .frq file, which messes up the int blocks. We inline payloads with positions which would also mess up the int blocks. Skipping offsets and TermInfo offsets hardwire the file pointers of frq prox files yet I need to change these to block + offset, etc. Separately this thread also started up, on how to customize how Lucene stores positional information in the index: http://www.gossamer-threads.com/lists/lucene/java-user/66264 So I decided to make a bit more progress towards flexible indexing by first modularizing/isolating the classes that actually write the index format. The idea is to capture the logic of each (terms, freq, positions/payloads) into separate interfaces and switch the flushing of a new segment as well as writing the segment during merging to use the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1387) Add LocalLucene
[ https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641264#action_12641264 ] Xibin Zeng commented on LUCENE-1387: Hey Guys! Where is this now? Has it been checked in yet? I am asking as I am currently planning a feature and wanted to know if it is realistic to take advantage of it now. Any update is appreciated! Add LocalLucene --- Key: LUCENE-1387 URL: https://issues.apache.org/jira/browse/LUCENE-1387 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Grant Ingersoll Priority: Minor Attachments: spatial.zip Local Lucene (Geo-search) has been donated to the Lucene project, per https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle the Lucene portion of integration. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1422) New TokenStream API
[ https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1422: -- Attachment: lucene-1422.take3.patch I added several things in this new patch: * hashCode() and equals() now incorporate the attributes * patch compiles against Java 1.4 * all core tests pass with and without the new API turned on (via TokenStream.setUseNewAPI(true)) * Added setToken() method to InvertedDocConsumerPerField and TermsHashConsumerPerField and updated the implementing classes. I have actually a question here, because I don't know these classes very well yet. Would it be better to add the Token to the DocInverter.FieldInvertState? I think I also have to review LUCENE-1426 to see if these changes are not in conflict ( I think 1426 should be committed first?) Outstanding: * dedicated junits for new APIs, even though the existing tests already cover a lot when setUseNewAPI(true) * javadocs * contrib streams and filters New TokenStream API --- Key: LUCENE-1422 URL: https://issues.apache.org/jira/browse/LUCENE-1422 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1422.patch, lucene-1422.take2.patch, lucene-1422.take3.patch This is a very early version of the new TokenStream API that we started to discuss here: http://www.gossamer-threads.com/lists/lucene/java-dev/66227 This implementation is a bit different from what I initially proposed in the thread above. I introduced a new class called AttributedToken, which contains the same termBuffer logic from Token. In addition it has a lazily-initialized map of Class? extends Attribute - Attribute. Attribute is also a new class in a new package, plus several implementations like PositionIncrementAttribute, PayloadAttribute, etc. Similar to my initial proposal is the prototypeToken() method which the consumer (e. g. DocumentsWriter) needs to call. The token is created by the tokenizer at the end of the chain and pushed through all filters to the end consumer. The tokenizer and also all filters can add Attributes to the token and can keep references to the actual types of the attributes that they need to read of modify. This way, when boolean nextToken() is called, no casting is necessary. I added a class called TestNewTokenStreamAPI which is not really a test case yet, but has a static demo() method, which demonstrates how to use the new API. The reason to not merge Token and TokenStream into one class is that we might have caching (or tee/sink) filters in the chain that might want to store cloned copies of the tokens in a cache. I added a new class NewCachingTokenStream that shows how such a class could work. I also implemented a deep clone method in AttributedToken and a copyFrom(AttributedToken) method, which is needed for the caching. Both methods have to iterate over the list of attributes. The Attribute subclasses itself also have a copyFrom(Attribute) method, which unfortunately has to down- cast to the actual type. I first thought that might be very inefficient, but it's not so bad. Well, if you add all Attributes to the AttributedToken that our old Token class had (like offsets, payload, posIncr), then the performance of the caching is somewhat slower (~40%). However, if you add less attributes, because not all might be needed, then the performance is even slightly faster than with the old API. Also the new API is flexible enough so that someone could implement a custom caching filter that knows all attributes the token can have, then the caching should be just as fast as with the old API. This patch is not nearly ready, there are lot's of things missing: - unit tests - change DocumentsWriter to use new API (in backwards-compatible fashion) - patch is currently java 1.5; need to change before commiting to 2.9 - all TokenStreams and -Filters should be changed to use new API - javadocs incorrect or missing - hashcode and equals methods missing in Attributes and AttributedToken I wanted to submit it already for brave people to give me early feedback before I spend more time working on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TokenStream and Token APIs
Grant Ingersoll wrote: On Oct 19, 2008, at 7:08 PM, Michael Busch wrote: Grant Ingersoll wrote: On Oct 19, 2008, at 12:56 AM, Mark Miller wrote: Grant Ingersoll wrote: Bear with me, b/c I'm not sure I'm following, but looking at https://issues.apache.org/jira/browse/LUCENE-1422, I see at least 5 different implemented Attributes. So, let's say I add a 5 more attributes and now have a total of 10 attributes. Are you saying that I then would have, potentially, 10 different variables that all point to the token as in the code snippet above where the casting takes place? Or would I just create a single Super attribute that folds in all of my new attributes, plus any other existing ones? Or, maybe, what I would do is create the 5 new attributes and then 1 new attribute that extends all 10, thus allowing me to use them individually, but saving me from having to do a whole ton of casting in my Consumer. Potentially one consumer doing 10 things, but not likely right? I mean, things will stay logical as they are now, and rather than a super consumer doing everything, we will still have a chain of consumers each doing its own piece. So more likely, maybe something comes along every so often (another 5, over *much* time, say) and each time we add a Consumer that uses one or two TokenStream types. And then its just an implementation detail on whether you make a composite TokenStream - if you have added 10 new attributes and see it fit to make one consumer use them all, sure, make a composite, super type, but in my mind, the way its done in the example code is clearer/cleaner for a handful of TokenStream types. And even if you do make the composite,super type, its likely to just be a sugar wrapper anyway - the implementation for say, payload and positions, should probably be maintained in their own classes anyway. Well, there are 5 different attributes already, all of which are commonly used. Seems weird to have to cast the same var 5 different ways. Definitely agree that one would likely deal with this by wrapping, but then you end up either needing to extend your wrapper or add new wrappers... Well yes, there are 5 attributes, but n neither of the core tokenstreams and -filters that I changed in my patch did I have to use more than two or three of those. Currently the only attributes that are really used are PositionIncrementAttribute and PayloadAttribute. And the OffsetAttribute when TermVectors are turned on. Even in the indexing chain currently we don't have a single consumer that needs all attributes. The FreqProxWriter needs positions and payloads, the TermVectorsWriter needs positions and offsets. I have an application that uses all the attributes of a Token, or at least, almost all of them. There are many uses for Lucene's analysis code that have nothing to do with indexing, Consumers or even Lucene. Also, you don't have to cast the same variable multiple times. In the current patch you would call e. g. token.getAttribute(PayloadAttribute.class) and keep a reference to it in the consumer or filter. IMO even calling getAttribute() 5 times or so and storing the references wouldn't be so bad. And if you really don't like it you could make a wrapper as you said. You also mentioned the disadvantages of the wrapper, e. g. that you would have to extend it to add new attributes. But then, isn't that the same disadvantage the current Token API has? True. I didn't say the idea was bad, in fact I mostly like it, I was just saying I'd like to explore how it would work in practice and the main thing that struck me was all the casting or all the references. Since it's likely that you only deal with a Token one at a time, you're right, it's probably not a big deal other than the code looks funny, IMO. You could even use the new API in exact the same way as the old one. Just create a subclass of Token that has all members you need and don't add any attributes. So I think the new API adds more flexibility, and still offers to use it in the same way as the old one. I however think the recommended best practice should be to use the new attributes, for reusability of consumers that only need certain attributes. Perhaps it would be useful for Lucene to offer exactly one subclass of Token that we guarantee will always have all known Attributes (i.e. the ones Lucene provides) available to it for casting purposes. Yeah we could do that. In fact, I did exactly this when I started working on this patch. I created a class called PlainToken, which had all the termBuffer and attributes logic, and changed Token to extend it. Then the new getToken() method would return an instance of PlainToken. My main concern with this approach is that it will make the code in the indexer more complicated, because it always has to check if we have a Token or PlainToken; if it's a Token then it has to use the get*() method directly, for a