Synchronizing Lucene indexes across 2 application servers
I've a web application which uses Lucene for search functionality. Lucene search requests are served by web services sitting on 2 application servers (IIS 7).The 2 application servers are Load balanced using netscaler. Both these servers have a batch job running which updates search indexes on the respective servers in the night on a daily basis. I need to synchronize search indexes on these 2 servers so that at any point of time both the servers have uptodate indexes. I was thinking what could be the best architecture/design strategy to do so given the fact that any of the 2 application servers could be serving search request depending upon its availability. Any inputs please? Thanks for reading! -- View this message in context: http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: (was: LUCENE-1693.patch) AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Sorry, last patch was invalid (did not compile), I forgot to to revert some changes before posting. Attached patch has still problems in TeeTokenStream, SinkTokenizer and CachingTokenFilter (see before), but fixes: - double cloning of payloads - the first of your tests works correct, even if i remove next() from StopFilter and/or LowercaseFilter AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Updated: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1692: Attachment: LUCENE-1692.txt adds tests for thaianalyzer token offsets and types, both of which have bugs! tests for correct behavior are included but commented out. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721116#action_12721116 ] Simon Willnauer commented on LUCENE-1696: - I will be around and fix / adjust it if it needs some changes. If I do not react please send me a ping on this issue. Thanks Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721135#action_12721135 ] Michael Busch commented on LUCENE-1693: --- {quote} For backwards-compatiblility we should deprecate the current versions of these class [and only let them implement next(Token)]. {quote} I agree. With my patch the Tee/Sink stuff doesn't work in all situations either, when the new API is used. We need to deprecate tee/sink and write a new class that implements the same functionality with the new API. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional
Re: Synchronizing Lucene indexes across 2 application servers
Could you re-ask this on java-user instead? Thanks! (java-dev is for discussing how to make changes to Lucene; java-user is for discussing usage of Lucene). Mike On Thu, Jun 18, 2009 at 2:13 AM, mitu2009musicfrea...@gmail.com wrote: I've a web application which uses Lucene for search functionality. Lucene search requests are served by web services sitting on 2 application servers (IIS 7).The 2 application servers are Load balanced using netscaler. Both these servers have a batch job running which updates search indexes on the respective servers in the night on a daily basis. I need to synchronize search indexes on these 2 servers so that at any point of time both the servers have uptodate indexes. I was thinking what could be the best architecture/design strategy to do so given the fact that any of the 2 application servers could be serving search request depending upon its availability. Any inputs please? Thanks for reading! -- View this message in context: http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Wed, Jun 17, 2009 at 4:13 PM, Mark Millermarkrmil...@gmail.com wrote: Michael Busch wrote: Everyone who is unhappy with the release TODO's, go back in your mail archive to the 2.2 release and check how many tedious little changes we made to improve the release quality. And besides the maven stuff, there is not really more to do compared to pre-2.2, it's just documented in a more verbose (=RM-friendly) way. I didn't mean to imply anything untowards :) I'm grateful for the work you guys have put into making it all more friendly. I know I have seen many of Mike M's wiki updates on this page too, and I've always been sure its for the better. Well, I made lots of silly mistakes during my releases :) (if you're not making mistakes, you're not trying hard enough) So every time I made a mistake I went and updated it. Even still, when I look at the process, I remember why I clung to Windows for so long :) Now I'm happily on Ubuntu and can still usually avoid such fun work :) The next step after Ubuntu is OS X, of course ;) I'll happily soldier on though. I just wish it was all in Java :) I pretty much find any excuse to go and write stuff in Python ;) So, I wrote a Python script that goes signs/verifies sigs on all the Maven artifacts. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721142#action_12721142 ] Uwe Schindler commented on LUCENE-1693: --- OK, we can merge our patches then! At the moement I see no real show-stoppers with the current aproach, have you tested thoroughly and measured performance? All tests from core and contrib/analyzers pass, the problems with your last TestCompatibility.java are Tee/Sink problems. The interesting part (if we stay with my not-so-elegant-anymore solution because of reflections hacks), would be to remove the deprecated next(Token) methods from core streams, which would be a great code cleanup! AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1673: -- Attachment: LUCENE-1673.patch Final patch version with updated javadocs. I will commit in a day or two :-) When committing, I will also remove TrieRange from contrib/search (not included in patch). If you want to make javadocs updates, feel free to post an updated patch or do it after I committed. After that I will do some work for NumericField and NumericSortField as well as moving the parsers to FieldCache and make the plain-text-number parsers public there, too. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721158#action_12721158 ] Uwe Schindler commented on LUCENE-1693: --- By the way, I tested Solr's token streams also after updating the lucene jar file. All tests pass (only some not related ones fail because of latest changes in Lucene trunk and some compile failures because of changes in no-released APIs). Solrs TokenStreams are all programmed with the old-api, but they get inverted using incrementToken from our patch. Also the solr query parser seems to work. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721196#action_12721196 ] Michael McCandless commented on LUCENE-1630: * I wonder if we should have a separate TopScorer class, that doesn't expose nextDoc/advance methods? And then a separate QueryWeight.topScorer method instead of a boolean arg to QueryWeight.scorer. (I'm torn...). EG, if you get a topScorer, you are not supposed to call nextDoc/advance on it, so it really feels like it wants to be a different class than Scorer... * Update CHANGES entry based on iterations on the patch (eg supportsDocsOutOfOrder -- acceptsDocsOutOfOrder) * Can we rename QW.scoresOutOfOrder - QW.scoresDocsOutOfOrder? * In IndexSearcher ~line 221 shouldn't was pass true for scoresDocsInOrder in {{Scorer scorer = weight.scorer(reader, false, true)}}? * QyertWeight - QueryWeight * I think CustomScoreQuery.scorer should actually always score docs in order? So CustomWeight.scoresOutOfOrder should return false? And CustomWeight.scorer should pass true for scoreDocsInOrder to all sub-weights? Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721204#action_12721204 ] Shai Erera commented on LUCENE-1630: bq. QyertWeight - QueryWeight I'll fix. Can you please next time give me a hint on where did you find it? :) bq. I wonder if we should have a separate TopScorer class I remember that at some point I suggested to have a score(Searcher, Collector) on QueryWeight, and make Scorer.score(Collector) package-private (of course we'd need to deprecate first and invent a new name). But then I realized that custom weights would still need access to Scorer.score(Collector) if they want to use an existing Scorer or something. Taking Scorer.score(Collector) out of Scorer and into TopScorer is a large re-factoring. Are you sure about this? I just think of all the Scorers we have, and out there, that need to impl a new class, and possible duplicate a lot of code that is today shared between the top-level-scorer and iterator-type-scorer. I understand what you say so it really feels like it wants to be a different class than Scorer - I feel that too. But I don't see a great ROI here. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721208#action_12721208 ] Shai Erera commented on LUCENE-1630: bq. I think CustomScoreQuery.scorer should actually always score docs in order? Why? I don't see that it relies on doc id orderness anywhere. What if its subWeight is a BooleanWeight and I use a Collector which accepts docs out-of-order? Will I have a problem if I ask for an out-of-order Scorer? Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: (was: LUCENE-1693.patch) AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Again an update: Unified the reuseable tokens in the TokenWrapper.delegate. No it is always set after each action, so no state changes left out. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721276#action_12721276 ] Michael McCandless commented on LUCENE-1630: bq. Can you please next time give me a hint on where did you find it? OK :) It's a quick search through the patch file though ;) bq. Taking Scorer.score(Collector) out of Scorer and into TopScorer is a large re-factoring. Are you sure about this? I just think of all the Scorers we have, and out there, that need to impl a new class, and possible duplicate a lot of code that is today shared between the top-level-scorer and iterator-type-scorer. I'm definitely not sure about it... For Scorers that don't have anything special to do when they are top, we'd have a default impl (get a non-top Scorer and iterate over it, like Scorer.score now does. So I think the only weight that'd do something interesting is BooleanQuery's. But I agree this is a big change, so let's hold off for now? With search specialization (LUCENE-1594) the difference between being top and being sub seems to be more important {quote} bq. I think CustomScoreQuery.scorer should actually always score docs in order? Why? I don't see that it relies on doc id orderness anywhere {quote} CustomScorer's nextDoc uses advance on its subScorers. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add
Re: Fuzzy search change
This would make an awesome addition to Lucene! This is similar to how Lucene's spellchecker identifies candidates, if I understand it right. Would you be able to port it to java? Mike On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussava...@mapmyindia.com wrote: Hi, I wrote on this a long time ago, but haven't followed it up. I just finished a C++ implementation of a spell check module in my software. I borrowed the idea from Xapian. It is to use a trigram index to filter results, and then use Edit Distance on the filtered set. Would such a solution be acceptable to the Lucene Community? The details of my implementation are as follows: 1) QDBM data store hash map 2) Trigram tokenizer on the input string 3) Data store hash(key,value) = (trigram, keyword_id_listkw1...kwN) 4) Use trigram tokenizer and match with the trigram index 5) Get the IDs within the input cutoff 6) Run Edit Distance on the list and return In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it runs in 0.5 sec with a keyword record count of about 1,000,000 records. This is at least 3-4 times less than the current search times on Lucene. Since the results can be put in a thread safe hash table structure, the trigram search can be distributed over a thread pool also. Does this seem like a workable suggestion to the community? Regards -- Varun Dhussa Product Architect CE InfoSystems (P) Ltd http://www.mapmyindia.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721281#action_12721281 ] Grant Ingersoll commented on LUCENE-1693: - {quote} By the way, I tested Solr's token streams also after updating the lucene jar file. All tests pass (only some not related ones fail because of latest changes in Lucene trunk and some compile failures because of changes in no-released APIs). Solrs TokenStreams are all programmed with the old-api, but they get inverted using incrementToken from our patch. Also the solr query parser seems to work. {quote} Did you look at the performance on this? AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721285#action_12721285 ] Shai Erera commented on LUCENE-1630: bq. CustomScorer's nextDoc uses advance on its subScorers. Yeah I noticed that, but thought that out-of-order means a top-scorer usually, and then score(Collector) is called. But now I see CustomScorer does not implement score(Collector) which means it uses Scorer's, which calls nextDoc() and advance(). Regarding TopScorer, it'd need to get a Scorer as input, otherwise what would be its default impl for score(Collector)? I thought it should be the current one of Scorer's. Will post a patch soon. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that
[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1630: --- Attachment: LUCENE-1630.patch Implemented the latest comments, except for TopScorer Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721311#action_12721311 ] Mark Miller commented on LUCENE-1595: - bq. I added readContentSource.alg just for that purpose and ran it over the Wikipedia dump. All documents were read successfully. I figured you probably had, but they won't end up coming after you, they will come after me :) As expected, no issues hit yet though. I'll commit this later today. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721320#action_12721320 ] Shai Erera commented on LUCENE-1595: bq. they won't end up coming after you, they will come after me :) I promise to cover for you if that happens :) bq. I'll commit this later today. Thanks ! Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721326#action_12721326 ] Michael McCandless commented on LUCENE-1673: Latest patch looks good Uwe! We can separately tweak the javadocs... Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1700: --- Attachment: LUCENE-1700.patch Attached patch. I added a test case showing it, then took that same approach (from LUCENE-1313) and the test passes. I also found that with NRT, because the deletions are applied before building the CFS after flushing, we wind up holding open both the non-CFS and CFS files on creating the reader. So, I changed deletions to flush after the CFS is built. I plan to commit in a day or two. LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader Key: LUCENE-1700 URL: https://issues.apache.org/jira/browse/LUCENE-1700 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1700.patch Original Estimate: 48h Remaining Estimate: 48h With LUCENE-1516, deletes are carried over in the SegmentReaders which means implementations of MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy) need to obtain deletion info from the SR (instead of from the SegmentInfo which won't have the information). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721410#action_12721410 ] Hoss Man commented on LUCENE-1677: -- {quote} I did ask: http://www.mail-archive.com/java-u...@lucene.apache.org/msg26726.html And nobody answered. So I think we should remove it, and the org.apache.lucene.SegmentReader.class system property? Can you post a patch? Thanks. {quote} FWIW: Google code search pops up a few uses in publicly available code... http://www.google.co.uk/codesearch?hl=enlr=q=org.apache.lucene.SegmentReader.class+-package%3Arepos%2Fasf%2Flucene%2Fjavasbtn=Search What jumps out at me is that apparently older versions of Compass relied on this feature ... it looks like Compass 2.0 eliminated the need for this class, but i just wanted to point this out. Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721418#action_12721418 ] Robert Muir commented on LUCENE-1692: - michael: I'm think I'm done here. if you consider any of the bugs important just let me know, can try to help get them fixed. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
I pretty much find any excuse to go and write stuff in Python There's Scala... On Thu, Jun 18, 2009 at 2:37 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Jun 17, 2009 at 4:13 PM, Mark Millermarkrmil...@gmail.com wrote: Michael Busch wrote: Everyone who is unhappy with the release TODO's, go back in your mail archive to the 2.2 release and check how many tedious little changes we made to improve the release quality. And besides the maven stuff, there is not really more to do compared to pre-2.2, it's just documented in a more verbose (=RM-friendly) way. I didn't mean to imply anything untowards :) I'm grateful for the work you guys have put into making it all more friendly. I know I have seen many of Mike M's wiki updates on this page too, and I've always been sure its for the better. Well, I made lots of silly mistakes during my releases :) (if you're not making mistakes, you're not trying hard enough) So every time I made a mistake I went and updated it. Even still, when I look at the process, I remember why I clung to Windows for so long :) Now I'm happily on Ubuntu and can still usually avoid such fun work :) The next step after Ubuntu is OS X, of course ;) I'll happily soldier on though. I just wish it was all in Java :) I pretty much find any excuse to go and write stuff in Python ;) So, I wrote a Python script that goes signs/verifies sigs on all the Maven artifacts. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721426#action_12721426 ] Uwe Schindler commented on LUCENE-1693: --- I only tested performance with the lucene benchmarker on the various standard analyzers. The tokenizer.alg produces after the patch the same results as before in almost the same time (time variations are bigger than differences). With an unmodified benchmarker, this is clear, benchmarkers tokenizer task call still the deprecated next(Token) and as all core analyzers still implement this directly, so there is no wrapping. I modified the tested tokenstreams and filters in core, that were used, and removed next(Token) and left only incrementToken() avalilable, in this case the speed difference was also not measureable in my configuration (Thinkpad T60, Core Duo, Win32). I also changed some of the filters to implement next(Token) only, others to only incrementToken(), to have a completely mixed old/new API chain, and still the same results (and same tokenization results, as seen in generated indexes for wikipedia). I also changed the benchmarker to use incrementToken(), which was also fine. To have a small speed incresase (but I was not able to measure it), I changed all tokenizers to use only incrementToken for the whole chain and changed the benchmarker to also use this method. In this case I was able to TokenStream.setOnlyUseNewAPI(true), which removed the backwards-compatibility-wrapper and the Token instance, so the chain only used the unwrapped simple attributes. In my opinion, tokenization was a little bit faster, faster than without any patch and next(Token). When the old API is completely removed, this will be the default behaviour. So I would suggest to review this patch, add some tests for heterogenous tokenizer chains and remove all next(...) implementations from all streams and filters and only implement incrementToken(). Contrib analyzers should then only be rewritten to the new API without the old API. The mentioned bugs with Tee/Sink are not related to this bug, but are more serious now, because the tokenizer chain is no longer fixed to on specfic API variant (it supports both mixed together). AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the
[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one
[ https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721427#action_12721427 ] Hoss Man commented on LUCENE-1646: -- As a general rule, code catching an execption and throwing a new exception with more details should (almost always) call initCause (unless the new Exception has a constructor that takes care of that part) to preserve all of the stack history. Client code that wants to get at the root exception can do so using getCause() In QueryParser... {code} } catch (ParseException tme) { // rethrow to include the original query: ParseException e = new ParseException(Cannot parse ' +query+ ': + tme.getMessage()); e.initCause(tme); throw e; } {code} In Trejkaz's code, something like... {code} } catch (ParseException pexp) { for (Throwable t = pexp; null != t; t = t.getCause()) { if (t instanceof OurCustomException) { takeActionOnCustomException((OurCustomException)t); } takeActionOnLuceneQueryParserException(exp) } } {code} QueryParser throws new exceptions even if custom parsing logic threw a better one - Key: LUCENE-1646 URL: https://issues.apache.org/jira/browse/LUCENE-1646 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Trejkaz We have subclassed QueryParser and have various custom fields. When these fields contain invalid values, we throw a subclass of ParseException which has a more useful message (and also a localised message.) Problem is, Lucene's QueryParser is doing this: {code} catch (ParseException tme) { // rethrow to include the original query: throw new ParseException(Cannot parse ' +query+ ': + tme.getMessage()); } {code} Thus, our nice and useful ParseException is thrown away, replaced by one with no information about what's actually wrong with the query (it does append getMessage() but that isn't localised. And it also throws away the underlying cause for the exception.) I am about to patch our copy to simply remove these four lines; the caller knows what the query string was (they have to have a copy of it because they are passing it in!) so having it in the error message itself is not useful. Furthermore, when the query string is very big, what the user wants to know is not that the whole query was bad, but which part of it was bad. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1595. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Thanks Shai, I just committed this. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Thu, Jun 18, 2009 at 3:07 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: I pretty much find any excuse to go and write stuff in Python There's Scala... I've only read about it so far but it does look good. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721451#action_12721451 ] Michael McCandless commented on LUCENE-1692: bq. michael: I'm think I'm done here. OK I'll review. Thanks!! bq. if you consider any of the bugs important just let me know, can try to help get them fixed. Likely I won't be able to judge the severity of these bugs... so please chime in if you think they should be fixed... Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721457#action_12721457 ] Robert Muir commented on LUCENE-1692: - Michael, I think it would be nice to fix the Thai offset bug, so highlighter will work. this is a safe one-line fix and its an obvious error. The SmartChineseAnalyzer empty token bug is pretty serious, i think indexing empty tokens for every piece of punctuation could really hurt similarity computation (am i wrong, never tried?) The Thai .type() bug is something that could be fixed later, i don't think the token type being ALPHANUM versus NUM is really hurting anyone. The issue where DutchAnalyzer doesnt do what it claims, i think thats not really hurting anyone, and they can use the snowball version if they want accurate snowball behavior. I do think the huge files in DutchAnalyzer that aren't being used can be removed if you want to save 1MB, but I'm not sure how important that is. Let me know your thoughts. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721460#action_12721460 ] Michael McCandless commented on LUCENE-1692: I'm seeing this test failure: {code} [junit] Testcase: testBuggyPunctuation(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer): Caused an ERROR [junit] null [junit] java.lang.AssertionError [junit] at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:240) [junit] at org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testBuggyPunctuation(TestSmartChineseAnalyzer.java:51) {code} It's because null is being passed to ts.next in the final assertTrue line: {code} nt = ts.next(nt); while (nt != null) { assertEquals(result[i], nt.term()); i++; nt = ts.next(nt); } assertTrue(ts.next(nt) == null); {code} Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Some thoughts around the use of reader.isDeleted and hasDeletions
I've made the changes to SegmentMerger and want to make the following changes to IndexReader.document(): (1) don't call ensureOpen() and (2) don't check isDeleted. Question is - can I make these changes on the current impls, or do I need to deprecate and come up w/ a new name? Here a new name is not a big challenge - we can choose: doc() or getDocument() for example. I don't feel rawDocument flows nicely (what's raw about it?) IMO, even though these are back-compat changes (to runtime), they are not likely to affect anyone. I mean, why would someone deliberately call document() when the reader has already been closed (unless he doesn't know it at the time of calling document()). For easy migration (if you rely on that feature), I can add isClose()/isOpen() w/ a default impl to call ensureOpen(). Or why to call document(doc) if the doc is deleted. What's the scenario? Anyway, those two changes are necessary as our merging code calls them, but already check that a doc is deleted or not before. So it's just a question of a new method vs. a runtime change. What do you think? Shai On Wed, Jun 10, 2009 at 6:39 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera ser...@gmail.com wrote: it makes sense because isDeleted() is essentially the *only* thing being done in the loop, and hence we can eliminate the loop entirely You mean that in case there is a matching segment, we can call matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)? Right... or rather directly calculate numDocs and docNum instead of using the loop. But in case it does not have a matching segment, we'd still need to iterate on the docs, and copy the term vectors one by one, right? Right, and that's the case where I think duplicating the code to remove a single branch-predictable boolean flag isn't warranted as it won't result in a measurable performance increase. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721461#action_12721461 ] Mark Miller commented on LUCENE-1692: - heh - +1 on fixing them all. Including reclaiming that 1 mb of space if we can ... Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721462#action_12721462 ] Michael McCandless commented on LUCENE-1692: Me too :) Robert can you cons up a patch? Which files can be safely removed from the DutchAnalyzer? (stems/words.txt?) Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721463#action_12721463 ] Robert Muir commented on LUCENE-1692: - michael, i guess junit from my eclipse != junit from ant, because it passes in eclipse...annoying I will fix the test so it runs correctly from ant. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721469#action_12721469 ] Michael McCandless commented on LUCENE-1692: Probably eclipse isn't running with asserts? Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Fuzzy search change
what would be the difference/benefit compared to standard lucene SpellChecker? If I I am not wrong: - Lucene SpellChecker uses standard lucene index as a storage for tokens instead of QDBM... meaning full inverted index with arbitrary N-grams length, with tf/idf/norms... not only HashMaptrigram, wordList - SC uses paradigm give me N Best candidates (similarity), not only all above cutoff... this Similarity depends (standard lucene Similarity) on N-Gram frequency, (one could even use some sexy norms to fine tune words...)... If I've read your proposal correctly and did not miss something important, my suggestion would be to have a look at lucene SC (http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html) before you start have fun, eks - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-dev@lucene.apache.org Sent: Thursday, 18 June, 2009 16:29:59 Subject: Re: Fuzzy search change This would make an awesome addition to Lucene! This is similar to how Lucene's spellchecker identifies candidates, if I understand it right. Would you be able to port it to java? Mike On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote: Hi, I wrote on this a long time ago, but haven't followed it up. I just finished a C++ implementation of a spell check module in my software. I borrowed the idea from Xapian. It is to use a trigram index to filter results, and then use Edit Distance on the filtered set. Would such a solution be acceptable to the Lucene Community? The details of my implementation are as follows: 1) QDBM data store hash map 2) Trigram tokenizer on the input string 3) Data store hash(key,value) = (trigram, keyword_id_list 4) Use trigram tokenizer and match with the trigram index 5) Get the IDs within the input cutoff 6) Run Edit Distance on the list and return In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it runs in 0.5 sec with a keyword record count of about 1,000,000 records. This is at least 3-4 times less than the current search times on Lucene. Since the results can be put in a thread safe hash table structure, the trigram search can be distributed over a thread pool also. Does this seem like a workable suggestion to the community? Regards -- Varun Dhussa Product Architect CE InfoSystems (P) Ltd http://www.mapmyindia.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721475#action_12721475 ] Robert Muir commented on LUCENE-1692: - probably, fixed it and testing with ant now. ill upload it at least so you can verify the behavior i've discovered. do you want me to include patch with the two bugfixes (chinese empty token and thai offsets), or give you something separate for those? for the other 2 bugs: fixing the Thai tokentype bug, well its really a bug in the standardtokenizer grammar. i wasn't sure you wanted to change that at this moment, but if you want it fixed let me know! in my opinion: fix for DutchAnalyzer is to deprecate/remove the contrib completely, since it claims to do snowball stemming, why shouldnt someone just use the Dutch snowball stemmer from the contrib/snowball package! Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721504#action_12721504 ] Robert Muir commented on LUCENE-1692: - ok got it, the IDEOGRAPHIC FULL STOP is being converted into a comma token by the tokenizer. if you use the default constructor: SmartChineseAnalyzer(), it won't load the default stopwords list, such as from my Luke screenshot. if you instead instantiate it like this: SmartChineseAnalyzer(true), then it loads the default stopwords list. the default stopwords list includes things like comma, so it ends out getting removed. maybe its not a bug, but this is really non-obvious behavior...! Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1692: Attachment: LUCENE-1692.txt patch with new testcase demonstrating the chinese behavior. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Tests fail to compile on JDK 1.4?
: We had some discussions about it, the easiest is, to set the bootclasspath : in the javac/ task to an older rt.jar during compilation. Because this : needs updates for e.g. Hudson (rt.jar missing) we said, that the one, who : releases the final version should simply check this before on the : compilation computer in the release process. there are ways to automate this sanity check in ant, i took a stab at this a while back... https://issues.apache.org/jira/browse/LUCENE-718 ...but i never moved forward with it becuase most people didn't seemed that concerned. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721512#action_12721512 ] Robert Muir commented on LUCENE-1692: - later tonight i can workup a patch to address the thai offset issue and at least javadoc'ing the chinese behavior. if you think the addt'l 2 issues [thai tokentype, dutchanalyzer behavior/huge files] should be fixed or documented in some way, please let me know. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: madvise(ptr, len, MADV_SEQUENTIAL)
Hmm... So the list at the bottom of this page looks accurate? http://www.gnu.org/software/hello/manual/gnulib/posix_005ffadvise.html If it is, then posix_fadvise works on Linux only? Perhaps madvise will be better then (judging by the much smaller unsupported list)? It seems to run on most platforms: http://www.gnu.org/software/hello/manual/gnulib/madvise.html On Wed, Jun 17, 2009 at 2:19 AM, Alan Bateman alan.bate...@sun.com wrote: Jason Rutherglen wrote: Alan, Do you think something like FileDescriptor.setAdvise (mirroring posix_fadvise) makes sense? -J Something like a posix_fadvise would be more appropriate for FileChannel or maybe as a usage hint when opening the file (the new APIs for opening files are extensible to allow for additional options in the future or even implementation specific options). I don't think we've had much interest in doing this, maybe because it would be a no-op on many operating systems. -Alan.
[jira] Updated: (LUCENE-1313) Near Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * TestThreadedOptimize passes, LogMergePolicy now filters the segmentInfos based on the dir, rather than NRTMergePolicy passing in only ramInfos or primaryInfos. LogMergePolicy is careful to select contiguous segments, by passing in a subset of segmentInfos, the merge policy selection broke down. * TestIndexWriter.testAddIndexOnDiskFull, testAddIndexesWithCloseNoWait fails, which I don't think happened before. testAddIndexOnDiskFull fails when autoCommit=true which I'm not sure is a valid test by the time this patch goes in but it probably needs to be looked into. The other previous notes are still valid. Near Realtime Search Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721586#action_12721586 ] Jason Rutherglen commented on LUCENE-1539: -- I think it would be convenient to allow passing in the data files' absolute path, instead of assuming they're in a relative path. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1466: --- Attachment: LUCENE-1466.patch updated patch attached. - sync trunk (smart chinese analyzer(LUCENE-1629), etc.) - added a useful idiom to get ChatStream and make private CharReader constructor CharFilter - normalize characters before tokenizer -- Key: LUCENE-1466 URL: https://issues.apache.org/jira/browse/LUCENE-1466 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 2.4 Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch This proposes to import CharFilter that has been introduced in Solr 1.4. Please see for the details: - SOLR-822 - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1466) CharFilter - normalize characters before tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721588#action_12721588 ] Koji Sekiguchi edited comment on LUCENE-1466 at 6/18/09 7:04 PM: - updated patch attached. - sync trunk (smart chinese analyzer(LUCENE-1629), etc.) - added a useful idiom to get CharStream and make private CharReader constructor was (Author: koji): updated patch attached. - sync trunk (smart chinese analyzer(LUCENE-1629), etc.) - added a useful idiom to get ChatStream and make private CharReader constructor CharFilter - normalize characters before tokenizer -- Key: LUCENE-1466 URL: https://issues.apache.org/jira/browse/LUCENE-1466 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 2.4 Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch This proposes to import CharFilter that has been introduced in Solr 1.4. Please see for the details: - SOLR-822 - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1692: Attachment: LUCENE-1692.txt patch with the two one-line fixes: 1. fix offsets for thai analyzer so highlighting, etc will work. 2. use stopwords list by default for smartchineseanalyzer so punctuation isn't indexed in a strange way. i updated the testcases to reflect these. Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir Assignee: Michael McCandless Fix For: 2.9 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org