RE: latest lucene update
Did you also test, that the speed was going back to normal with the latest fix in trunk (without modifying Solr code)? I ran the Solr tests with updated lucene-core-2.9.jar here, but I was not able to find out, which of the tests had the big slowdown. I only noticed some speedup in some tests related to search. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, July 16, 2009 2:57 AM To: java-dev@lucene.apache.org Subject: Re: latest lucene update Thanks guys, I had actually meant this message to go to solr-dev... hence the but I think we should implement the new methods anyway. I've implemented them, and the performance has returned to normal. -Yonik http://www.lucidimagination.com On Wed, Jul 15, 2009 at 4:00 PM, Yonik Seeleyyo...@lucidimagination.com wrote: Running solr unit tests seems a fair bit slower now. I think the root cause may be this: http://search.lucidimagination.com/search/document/a8bd12c3b87e98a3/speed_ of_booleanqueries_on_2_9 That may be fixed, but I think we should implement the new methods anyway. I'm also surprised that more changes weren't necessary to get the latest Lucene to work... one thing in particular is docs out of order - Solr currently requires them in-order to correctly create DocSet instances, and I'm not sure this is the case any more. I'll look into it. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1693: -- Attachment: lucene-1693.patch This is basically your last patch with these changes: - I removed AttributeSource.setAttributeFactory(factory). Since we have the constructor now that takes the factory as an arg, there should be no need to ever change the factory after a TokenStream was created. It would also lead to problems regarding e.g. Tee/Sink: a user could add attributes to the Tee, then change the factory, then create the sink. How could we then create the same attribute impls for the sink? So I think the right thing to do is to not allow changing the factory after the stream is instantiated. - I added the initial (untested) version of TeeSinkTokenFilter to demonstrate how I think it should work now. I'll finish tomorrow or Friday (add more javadocs and unit test). I'll also add the CachingAttributeTokenFilter, which is essentially almost the same as the new inner class of TeeSinkTokenFilter. When I have CATF the inner class can probably just extend it. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for
[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1566: Attachment: LUCENE_1566_IndexInput_Changes.patch * Set chunkSize to Integer.MAX_VALUE on 64 Bit JVM * Removed 64bit JVM condition as chunkSize is set to maximum in 64bit case * Added CHANGES.TXT to patch @Mike: once you commit this change I will close this issue. Simon Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1747) Contrib/Spatial needs code cleanup before release
Contrib/Spatial needs code cleanup before release - Key: LUCENE-1747 URL: https://issues.apache.org/jira/browse/LUCENE-1747 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 I had a brief look at the spatial sources and found that there are quite a couple of warnings, main methods, loggers, immutable classes not having final members, unused variables, unused methodes etc. Once mike has commited https://issues.apache.org/jira/browse/LUCENE-1505 I will start cleaning this up a bit. It seem that there are not many unit test in this project either I might open an issue for 3.0 / 3.1 later though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adriano Crestani updated LUCENE-1567: - Attachment: lucene_trunk_FlexQueryParser_2009july16_v7.patch Here are some updates for the new query parser: - support to set the minimum fuzzy similarity was added to the configuration handler - get methods were added to the configuration handler, so the user that is used to the old query parser can easily access the configuration in the old way - renamed everything referencing lucene2 to original - removed one author tag - improved javadoc documentation - added a constructor to LuceneQueryParserHelper that accepts an Analyzer as argument, I think Lucene users are used to create a query parser and also pass the analyzer That's it :) I have also noticed that when building using ant build-contrib it does not copy .properties files to the jar. The new query parser uses a property file to read the NLS messages from and I'm getting some message warnings when running the tests. Is anybody getting the same warnings? New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Fix For: 2.9 Attachments: lucene_1567_adriano_crestani_07_13_2009.patch, lucene_trunk_FlexQueryParser_2009July09_v4.patch, lucene_trunk_FlexQueryParser_2009July10_v5.patch, lucene_trunk_FlexQueryParser_2009july15_v6.patch, lucene_trunk_FlexQueryParser_2009july16_v7.patch, lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, new_query_parser_src.tar, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in
[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731880#action_12731880 ] Michael McCandless commented on LUCENE-1566: SimpleFSDirectory is missing from the last patch? Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731886#action_12731886 ] Simon Willnauer commented on LUCENE-1566: - bq. SimpleFSDirectory is missing from the last patch? ups! :) Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731893#action_12731893 ] Uwe Schindler commented on LUCENE-1693: --- Ok looks good. I think you will go to bed now, so the work would not collide. If you start to program again, ask me, that I will post a patch (which makes merging simplier). TortoiseSVN has a problem with merging added files, so when applying your patch I have to remove them first :-( Some comments: - TeeSinkTokenFilter looks good, I think we should also add a test for it (in principle the version of TestTeeTokenFilter from current trunk, not the one reverted to old API from the current patch) - I do not understand completely why this WeakReference is needed between Tee and Sink? If it is needed, the code may fail with NPE, when Reference.get() returns null. The idea is, that one can create a Sink for the Tee and throw the Sink away. Tee would then simply not pass the attributes anymore to the sink? If this is the case, the check for Reference.get()==null is really missing. - Should I implement CachingAttributesFilter as replacement for CachingTokenFilter, or will you do it together with TeeSink? I will now start to add all the finals to the missing core analyzers. bq. The only small performance improvement we should probably make is to avoid checking which method in TokenStream is overridden when onlyUseNewAPI==true I could disable this for next() and next(Token). In the case of incrementToken, it should really check, that it is enabled, because not doing so would fail hard create endless loops. So the check should be there in all cases. But if onlyUseNewAPI is enabled, I could simply define hasNext and hasReusableNext=false. I will do this. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731896#action_12731896 ] Grant Ingersoll commented on LUCENE-1693: - Favor to ask, when this is ready to commit, can you give a few days notice so that the rest of us can look at it before committing? I've been keeping up with the comments, but not the patches. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To
Re: Search in non-linguistic text
Ack... Clicked on the wrong group. Sorry - I'll move it. -- View this message in context: http://www.nabble.com/Search-in-non-linguistic-text-tp24515712p24515926.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans should be abstract
getPayloadSpans on org.apache.lucene.search.spans should be abstract Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4.1, 2.4 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hugh Cayless updated LUCENE-1748: - Summary: getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract (was: getPayloadSpans on org.apache.lucene.search.spans should be abstract) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939 ] Earwin Burrfoot commented on LUCENE-1748: - bq. Shouldnt it throw a runtime exception (unsupported operation?) or something? What is the difference between adding an abstract method and adding a method that throws exception in regards to jar drop in back compat? In both cases when you drop your new jar in you get an exception, except in the latter case exception is deferred. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940 ] Hugh Cayless commented on LUCENE-1748: -- Ah. I figured it would be something like that. Yes, if abstract isn't possible, an UnsupportedOperationException would at least get closer to the source of the problem. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940 ] Hugh Cayless edited comment on LUCENE-1748 at 7/16/09 6:43 AM: --- Ah. I figured it would be something like that. Yes, if abstract isn't possible, an UnsupportedOperationException would at least get closer to the source of the problem. From my perspective at least, backwards compatibility is already broken, since Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans--but I understand y'all will have different requirements in this regard. was (Author: hcayless): Ah. I figured it would be something like that. Yes, if abstract isn't possible, an UnsupportedOperationException would at least get closer to the source of the problem. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
bq. Shouldnt it throw a runtime exception (unsupported operation?) or something? What is the difference between adding an abstract method and adding a method that throws exception in regards to jar drop in back compat? In both cases when you drop your new jar in you get an exception, except in the latter case exception is deferred. Yeah, its dicey - I suppose the idea is that, if you used the code as you used to, it wouldnt try and call getPayloadSpans? And so if you kept using non payloadspans functionality, you would be set, and if you tried to use payloadspans you would get an exception saying the class needed to be updated? But if you make it abstract, we lose jar drop (I know I've read we don't have it for this release anyway) in and everyone has to implement the method. At least with the exception, if you are using the class as you used to, you can continue to do so with no work? Not that I 've considered it for very long at the moment. I know, I see your point - this back compat stuff is always dicey - thats why I throw it out there with a question mark - hopefully others will continue to chime in. On Thu, Jul 16, 2009 at 9:38 AM, Earwin Burrfoot (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939] Earwin Burrfoot commented on LUCENE-1748: - bq. Shouldnt it throw a runtime exception (unsupported operation?) or something? What is the difference between adding an abstract method and adding a method that throws exception in regards to jar drop in back compat? In both cases when you drop your new jar in you get an exception, except in the latter case exception is deferred. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- -- - Mark http://www.lucidimagination.com
Re: latest lucene update
On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote: Did you also test, that the speed was going back to normal with the latest fix in trunk (without modifying Solr code)? I didn't - I was already part way through implementing advance() in Solr. I'm sure the advance() fix in Lucene would have worked too though. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731947#action_12731947 ] Uwe Schindler commented on LUCENE-1693: --- I forgot: I also implemented the final next() methods in all non-final classes. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch New patch with some more work. First the phantastic news: As CachingTokenFilter has no API to access the cached attributes/tokens directly, it does not need to be deprecated, it just switched the internal and hidden impl to incrementToken() and attributes. I also added an additional test in the BW-Testcase, that checks if the caching also works for your strange POSTokens. And it works! You can even mix the consumers, e.g. first use new API to cache tokens and then replay using the old API. really cool. The problem, why the POSToken was not preserved in the past was an error in TokenWrapper.copyTo(). This method created a new Token and copied the contents into it using reinit(). Now it simply creates a clone and let delegate point to it (this is how the caching worked before). In principle Tee/SinkTokenizer could also work like this, the only problem with this class is the fact, that it has a public API that exposes the Token instances to the outside. Because of that, there is no way around deprecating. Your new TeeSinkTokenFilter looks good, it only had one problem: It used addAttributeImpl to add the attribute of the Tee to the new created Sink. Because of this, the sink got the same instance as the parent added. With useOnlyNewAPI, this does not have an effect for the standard attributes, as the ctor already created a Token instance as implementation and added it to the stream, so addAttributeImpl had no effect. I changed this to use the getAttributeClassesIterator and added a new attribute instance for each attribute using addAttribute to the sink. As the factory is the same, the attributes are generated in the same way. TeeSinkTokenizer would only *not* work correctly if somebody addes an custom instance using addAttributeImpl in one ctor of another filter in the chain. In this case, the factory would create another impl and restoreState throws IAE. In backwards compatibility mode (default) the new created sink and also the tee have always the default TokenWrapper implementation, so state restoring also works. You only have a problem if you change useOnlyNewAPIU inbetween (which would always create corrupt chains). Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? I started to create a test for the new TeeSinkTokenFilter, but there is one thing missing: The original test created a subclass of SinkTokenizer, overriding add() to filter the tokens added to the sink. This functionality is missing with the new API. The correct workaround would be to plug a filter around the sink and filter the tokens there? The problem is then, that the cache always contains also non-needed tokens (the old impl would not store them in the sink). Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would not work, as contents of state pkg-private?). Somehow else? Or leave it as it is and let the user plug the filter on top of the sink (I prefer this)? AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class
RE: latest lucene update
OK. At least I have seen a speed up during my tests :). I have the logs somewhere. Which tests were affected negative, then I can look into the before/after logs? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, July 16, 2009 3:53 PM To: java-dev@lucene.apache.org Subject: Re: latest lucene update On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote: Did you also test, that the speed was going back to normal with the latest fix in trunk (without modifying Solr code)? I didn't - I was already part way through implementing advance() in Solr. I'm sure the advance() fix in Lucene would have worked too though. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731968#action_12731968 ] Mark Miller commented on LUCENE-1748: - My response sent to mailing list: bq. Shouldnt it throw a runtime exception (unsupported operation?) or something? What is the difference between adding an abstract method and adding a method that throws exception in regards to jar drop in back compat? In both cases when you drop your new jar in you get an exception, except in the latter case exception is deferred. Yeah, its dicey - I suppose the idea is that, if you used the code as you used to, it wouldnt try and call getPayloadSpans? And so if you kept using non payloadspans functionality, you would be set, and if you tried to use payloadspans you would get an exception saying the class needed to be updated? But if you make it abstract, we lose jar drop (I know I've read we don't have it for this release anyway) in and everyone has to implement the method. At least with the exception, if you are using the class as you used to, you can continue to do so with no work? Not that I 've considered it for very long at the moment. I know, I see your point - this back compat stuff is always dicey - thats why I throw it out there with a question mark - hopefully others will continue to chime in. - Show quoted text - getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972 ] Earwin Burrfoot commented on LUCENE-1748: - I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. I also have a faint feeling that if you expose a method like ClassA method(); you can then upgrade it to SubclassOfClassA method(); without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731971#action_12731971 ] Mark Miller commented on LUCENE-1748: - bq. From my perspective at least, backwards compatibility is already broken, since Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans Ah, I see - I hadn't looked at this issue in a long time. It looks like you must implement it to do much of anything right? We need to address this better - perhaps abstract is the way to go. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972 ] Earwin Burrfoot edited comment on LUCENE-1748 at 7/16/09 7:54 AM: -- I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. -I also have a faint feeling that if you expose a method like- -ClassA method();- -you can then upgrade it to- -SubclassOfClassA method();- -without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless- Ok, I'm wrong. was (Author: earwin): I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. I also have a faint feeling that if you expose a method like ClassA method(); you can then upgrade it to SubclassOfClassA method(); without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731984#action_12731984 ] Mark Miller commented on LUCENE-1748: - bq. the whole getPayloadSpans deal is a herecy. heh. don't dig too deep - it also has to load all of the payloads as it matches whether you ask for them or not (if they exist). The ordered or unordered matcher also has to load them and dump them in certain situation when they are not actually needed. Lets look at what we need to do to fix this - we don't have to worry too much about back compat, cause its already pretty screwed I think. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: DISI semantics
Uwe / Yonik, DISI's class javadoc states this: Implementations of this class are expected to consider {...@link Integer#MAX_VALUE} as an invalid value. Therefore last cannot be set to MAX_VAL in the above example, if it wants to be a DISI at least. Phew ... that was a long issue. I was able to find the conversation on -1 vs. any value before the first there: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12714298 That link points to my response to Mike w/ why I think it'd be wrong to relax the policy of docId(). You can read 1-2 comments up and down to get the full conversation. In short, if we don't document clearly what is returned by docId() before the iteration started, it will be hard for a code which receives a DISI to determine whether to call nextDoc() or start by collecting what docId() returns. Can be worked around though, but I think the API is clear now and does not leave room for interpretation. Shai On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Jul 15, 2009 at 6:55 PM, Michael McCandlessluc...@mikemccandless.com wrote: I believe we debated allowing the DISI to return any docID less than its first real docID, not only -1, as you've done here, but I think Shai found something wrong with that IIRC... but I can't find this discussion. Shai do you remember / can you find this past discussion / am I just hallucinating? I don't know if it exists in Lucene, but I guess I can see the benefit of only having -1 or NO_MORE_DOCS. Consider a simplified ConjunctionScorer that didn't do anything in the constructor but simply skipped one iterator and then did the logic of doNext() until they all matched. One could get a false hit with my theoretical SliceDocIdSetIterator above. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731979#action_12731979 ] Mark Miller commented on LUCENE-1748: - Okay, so it says: Implementing classes that want access to the payloads will need to implement this. But in reality, if you don't implement it, looks like your screwed if you add it to the container SpanQueries. whether you access the payloads or not. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1566: --- Attachment: LUCENE-1566.patch OK I reworked the patch some, tweaking javadocs, changes, etc., and simplifying the loops that read the bytes inside NIOFSDir SimpleFSDir. I think it's ready to commit. Simon can you take a look? Thanks. Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
[ https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731993#action_12731993 ] Michael McCandless commented on LUCENE-1505: bq. For completeness, shoudl we also add them for the ones with the shift value at the end? an char[]? I was reluctant to do this. Let's hold off add these when the need first arises? bq. I wonder if it would make sense to do some cleanup in the code (final vars and args etc.) and if we should remove this logging code Agreed -- looks like you've opened a new issue for this already; thanks! I'll commit shortly. Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils - Key: LUCENE-1505 URL: https://issues.apache.org/jira/browse/LUCENE-1505 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Ryan McKinley Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1505.patch Currently spatial contrib includes a copy of NumberUtils from solr (otherwise it would depend on solr) Once LUCENE-1496 is sorted out, this copy should be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: DISI semantics
OK, that makes sense: So the example of Yonik should be interpreted like this (I think this is the optimal solution as it does not use an additional if-clause to check if the iteration has already started): class SliceDocIdSetIterator extends DocIdSetIterator { private int doc=-1,act,last; public SliceDocIdSetIterator(int first, int last) { this.act=first-1; this.last=last; } public int docID() { return doc; } public int nextDoc() throws IOException { if (++actlast) act=NO_MORE_DOCS; return doc = act; } public int advance(int target) throws IOException { act=target; if (actlast) act=NO_MORE_DOCS; return doc = act; } } - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Shai Erera [mailto:ser...@gmail.com] Sent: Thursday, July 16, 2009 5:04 PM To: java-dev@lucene.apache.org; yo...@lucidimagination.com Subject: Re: DISI semantics Uwe / Yonik, DISI's class javadoc states this: Implementations of this class are expected to consider {...@link Integer#MAX_VALUE} as an invalid value. Therefore last cannot be set to MAX_VAL in the above example, if it wants to be a DISI at least. Phew ... that was a long issue. I was able to find the conversation on -1 vs. any value before the first there: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298 page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act ion_12714298 page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act ion_12714298 That link points to my response to Mike w/ why I think it'd be wrong to relax the policy of docId(). You can read 1-2 comments up and down to get the full conversation. In short, if we don't document clearly what is returned by docId() before the iteration started, it will be hard for a code which receives a DISI to determine whether to call nextDoc() or start by collecting what docId() returns. Can be worked around though, but I think the API is clear now and does not leave room for interpretation. Shai On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Jul 15, 2009 at 6:55 PM, Michael McCandlessluc...@mikemccandless.com wrote: I believe we debated allowing the DISI to return any docID less than its first real docID, not only -1, as you've done here, but I think Shai found something wrong with that IIRC... but I can't find this discussion. Shai do you remember / can you find this past discussion / am I just hallucinating? I don't know if it exists in Lucene, but I guess I can see the benefit of only having -1 or NO_MORE_DOCS. Consider a simplified ConjunctionScorer that didn't do anything in the constructor but simply skipped one iterator and then did the logic of doNext() until they all matched. One could get a false hit with my theoretical SliceDocIdSetIterator above. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
[ https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1505. Resolution: Fixed Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils - Key: LUCENE-1505 URL: https://issues.apache.org/jira/browse/LUCENE-1505 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Ryan McKinley Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1505.patch Currently spatial contrib includes a copy of NumberUtils from solr (otherwise it would depend on solr) Once LUCENE-1496 is sorted out, this copy should be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: DISI semantics
Agreed - that looks like the optimal solution. -Yonik http://www.lucidimagination.com On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote: OK, that makes sense: So the example of Yonik should be interpreted like this (I think this is the optimal solution as it does not use an additional if-clause to check if the iteration has already started): class SliceDocIdSetIterator extends DocIdSetIterator { private int doc=-1,act,last; public SliceDocIdSetIterator(int first, int last) { this.act=first-1; this.last=last; } public int docID() { return doc; } public int nextDoc() throws IOException { if (++actlast) act=NO_MORE_DOCS; return doc = act; } public int advance(int target) throws IOException { act=target; if (actlast) act=NO_MORE_DOCS; return doc = act; } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: DISI semantics
Of course - if you don't plan to push this DISI into uncontrolled land, you can use the previous solution as well. I.e., if you never rely on docId to know whether to start the iteration, and don't pass this DISI to Lucene somehow etc., there's no need to use act or adhere completely to the API. Otherwise, I agree, this looks to be the best solution. Maybe ... just maybe ... I'd change the 'if (++act last) act = NO_MORE_DOCS' to 'if (++act last) return doc = NO_MORE_DOCS' to avoid the 'act' assignment .. but since it will only happen once, I don't think it's worth it. On Thu, Jul 16, 2009 at 6:43 PM, Yonik Seeley yo...@lucidimagination.comwrote: Agreed - that looks like the optimal solution. -Yonik http://www.lucidimagination.com On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote: OK, that makes sense: So the example of Yonik should be interpreted like this (I think this is the optimal solution as it does not use an additional if-clause to check if the iteration has already started): class SliceDocIdSetIterator extends DocIdSetIterator { private int doc=-1,act,last; public SliceDocIdSetIterator(int first, int last) { this.act=first-1; this.last=last; } public int docID() { return doc; } public int nextDoc() throws IOException { if (++actlast) act=NO_MORE_DOCS; return doc = act; } public int advance(int target) throws IOException { act=target; if (actlast) act=NO_MORE_DOCS; return doc = act; } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1742) Wrap SegmentInfos in public class
[ https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732042#action_12732042 ] Michael McCandless commented on LUCENE-1742: I don't think we should make IndexWriter's ReaderPool public just yet? Maybe instead we can add API to query for whether a segment has pending unflushed deletes? (And fix core merge policies to use that API when deciding how to expungeDeletes). Wrap SegmentInfos in public class -- Key: LUCENE-1742 URL: https://issues.apache.org/jira/browse/LUCENE-1742 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1742.patch, LUCENE-1742.patch Original Estimate: 48h Remaining Estimate: 48h Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not need to be in the org.apache.lucene.index package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732050#action_12732050 ] Michael McCandless commented on LUCENE-1683: Do you have a proposed fix for this...? Or, why is RegexQuery treating the trailing . as a .* instead? RegexQuery matches terms the input regex doesn't actually match --- Key: LUCENE-1683 URL: https://issues.apache.org/jira/browse/LUCENE-1683 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.3.2 Reporter: Trejkaz I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting. The regex cat. will match cats but also anything with cat and 1+ following letters (e.g. cathy, catcher, ...) It is as if there is an implicit .* always added to the end of the regex. Here's a unit test for the behaviour I would expect myself: @Test public void testNecessity() throws Exception { File dir = new File(new File(System.getProperty(java.io.tmpdir)), index); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); try { Document doc = new Document(); doc.add(new Field(field, cat cats cathy, Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); } finally { writer.close(); } IndexReader reader = IndexReader.open(dir); try { TermEnum terms = new RegexQuery(new Term(field, cat.)).getEnum(reader); assertEquals(Wrong term, cats, terms.term()); assertFalse(Should have only been one term, terms.next()); } finally { reader.close(); } } This test fails on the term check with terms.term() equal to cathy. Our workaround is to mangle the query like this: String fixed = String.format((?:%s)$, original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732051#action_12732051 ] Michael McCandless commented on LUCENE-1566: OK thanks Simon; I'll commit shortly. Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1566. Resolution: Fixed Thanks Simon! Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, LUCENE_1566_IndexInput_Changes.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060 ] Steven Rowe commented on LUCENE-1683: - bq. ... why is RegexQuery treating the trailing . as a .* instead? JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing .*, unless you explicity append a $ to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing .*. The difference in the two implementations implies this is a kind of bug, especially since the javadoc contract on RegexCapabilities.match() just says @return true if string matches the pattern last passed to compile. The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt(). RegexQuery matches terms the input regex doesn't actually match --- Key: LUCENE-1683 URL: https://issues.apache.org/jira/browse/LUCENE-1683 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.3.2 Reporter: Trejkaz I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting. The regex cat. will match cats but also anything with cat and 1+ following letters (e.g. cathy, catcher, ...) It is as if there is an implicit .* always added to the end of the regex. Here's a unit test for the behaviour I would expect myself: @Test public void testNecessity() throws Exception { File dir = new File(new File(System.getProperty(java.io.tmpdir)), index); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); try { Document doc = new Document(); doc.add(new Field(field, cat cats cathy, Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); } finally { writer.close(); } IndexReader reader = IndexReader.open(dir); try { TermEnum terms = new RegexQuery(new Term(field, cat.)).getEnum(reader); assertEquals(Wrong term, cats, terms.term()); assertFalse(Should have only been one term, terms.next()); } finally { reader.close(); } } This test fails on the term check with terms.term() equal to cathy. Our workaround is to mangle the query like this: String fixed = String.format((?:%s)$, original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060 ] Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM: --- bq. ... why is RegexQuery treating the trailing . as a .* instead? JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing .*, unless you explicity append a $ to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing .*. The difference in the two implementations implies this is a kind of bug, especially since the javadoc contract on RegexCapabilities.match() just says @return true if string matches the pattern last passed to compile. The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt(). was (Author: steve_rowe): bq. ... why is RegexQuery treating the trailing . as a .* instead? JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing .*, unless you explicity append a $ to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing .*. The difference in the two implementations implies this is a kind of bug, especially since the javadoc contract on RegexCapabilities.match() just says @return true if string matches the pattern last passed to compile. The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt(). RegexQuery matches terms the input regex doesn't actually match --- Key: LUCENE-1683 URL: https://issues.apache.org/jira/browse/LUCENE-1683 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.3.2 Reporter: Trejkaz I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting. The regex cat. will match cats but also anything with cat and 1+ following letters (e.g. cathy, catcher, ...) It is as if there is an implicit .* always added to the end of the regex. Here's a unit test for the behaviour I would expect myself: @Test public void testNecessity() throws Exception { File dir = new File(new File(System.getProperty(java.io.tmpdir)), index); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); try { Document doc = new Document(); doc.add(new Field(field, cat cats cathy, Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); } finally { writer.close(); } IndexReader reader = IndexReader.open(dir); try { TermEnum terms = new RegexQuery(new Term(field, cat.)).getEnum(reader); assertEquals(Wrong term, cats, terms.term()); assertFalse(Should have only been one term, terms.next()); } finally { reader.close(); } } This test fails on the term check with terms.term() equal to cathy. Our workaround is to mangle the query like this: String fixed = String.format((?:%s)$, original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project
[ https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1728: Attachment: LUCENE-1728.txt Simon, I revised the patch. Here are the new instructions for the analyzers/common and analyzers/smartcn scheme. Sorry for the delay. {code} ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir contrib/analyzers/common mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn add contrib/analyzers/smartcn contrib/analyzers/common svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java svn move contrib/analyzers/build.xml contrib/analyzers/common svn move contrib/analyzers/pom.xml.template contrib/analyzers/common svn move contrib/analyzers/src contrib/analyzers/common ## 3. eclipse refresh at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8) {code} Move SmartChineseAnalyzer resources to own contrib project Key: LUCENE-1728 URL: https://issues.apache.org/jira/browse/LUCENE-1728 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1728.txt, LUCENE-1728.txt SmartChineseAnalyzer depends on a large dictionary that causes the analyzer jar to grow up to 3MB. The dictionary is quite big compared to all the other resouces / class files contained in that jar. Having a separate analyzer-cn contrib project enables footprint-sensitive users (e.g. using lucene on a mobile phone) to include analyzer.jar without getting into trouble with disk space. Moving SmartChineseAnalyzer to a separate project could also include a small refactoring as Robert mentioned in [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several classes should be package protected, members and classes could be final, commented syserr and logging code should be removed etc. I set this issue target to 2.9 - if we can not make it until then feel free to move it to 3.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project
[ https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1728: Attachment: LUCENE-1728.txt same patch, but this time i clicked ASF license... sorry! Move SmartChineseAnalyzer resources to own contrib project Key: LUCENE-1728 URL: https://issues.apache.org/jira/browse/LUCENE-1728 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 2.9 Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt SmartChineseAnalyzer depends on a large dictionary that causes the analyzer jar to grow up to 3MB. The dictionary is quite big compared to all the other resouces / class files contained in that jar. Having a separate analyzer-cn contrib project enables footprint-sensitive users (e.g. using lucene on a mobile phone) to include analyzer.jar without getting into trouble with disk space. Moving SmartChineseAnalyzer to a separate project could also include a small refactoring as Robert mentioned in [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several classes should be package protected, members and classes could be final, commented syserr and logging code should be removed etc. I set this issue target to 2.9 - if we can not make it until then feel free to move it to 3.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1749) FieldCache introspection API
FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732110#action_12732110 ] Hoss Man commented on LUCENE-1749: -- The motivation for this issue is all of the changes coming in 2.9 in how Lucene internally uses the FieldCache API -- the biggest change being per Segment sorting, but there may be others not immediately obvious. While these changes are backwards compatible from an API and functionality perspective, they could have some pretty serious performance impacts for existing apps that also use the FieldCache directly and after upgrading the apps suddenly seem slower to start (because of redundant FieldCache initialization) and require 2X as much RAM as they did before. This could lead people people to assume Lucene has suddenly became a major memory hog. SOLR- and SOLR-1247 are some quick examples of the types of problems that apps could encounter. Currently the only way for a User to even notice the problem is to do memory profiling, and the FieldCache data structure isn't the easiest to understand. It would be a lot nicer to have some methods for doing this inspection programaticly, so users could write automated tests for incorrect/redundent usage. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-1749: - Attachment: fieldcache-introspection.patch Here's the start of a patch to provide this functionality -- it just provides a new method/datastructure for inspecting the cache; the sanity checking utility methods should be straightforward assuming people think this is a good idea. The new method itself is fairly simple, but quite a bit of refactoring to how the caches are managed was necessary to make it possible to implement the method sanely. These changes to the FieldCache internals seem like they are generally a good idea from a maintenance standpoint even if people don't like the new method. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-1749: - Lucene Fields: [New, Patch Available] (was: [New]) Fix Version/s: 2.9 Technically this isn't a bug, so i probably shouldn't add it to the 2.9 blocker list, but i really think it would be a good idea to have something like this in the 2.9 release. At the very least: i'd like to put it on the list until/unless there is consensus that it's not needed. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732123#action_12732123 ] Mark Miller commented on LUCENE-1749: - nice - would be great if it could estimate ram usage as well. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class
[ https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1742: - Attachment: LUCENE-1742.patch * Reader pool isn't public anymore * Left methods of reader as public (could roll back?) * I'd rather that readerpool be public, however since it's new I guess we don't want people relying on it? * All tests pass * It would be great to get this into 2.9 Wrap SegmentInfos in public class -- Key: LUCENE-1742 URL: https://issues.apache.org/jira/browse/LUCENE-1742 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch Original Estimate: 48h Remaining Estimate: 48h Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not need to be in the org.apache.lucene.index package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732157#action_12732157 ] Michael McCandless commented on LUCENE-1749: +1 -- this'd be great to get into 2.9. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732166#action_12732166 ] Uwe Schindler commented on LUCENE-1749: --- Looks good as a start, one question about a comment: What do you mean with: * :TODO: is the int sort type still needed? ... doesn't seem to be used anywhere, code just tests custom for SortComparator vs Parser. I do not understand, do you want to remove the IntCache? What is different with it in comparison with the other ones? Uwe FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732190#action_12732190 ] Hoss Man commented on LUCENE-1749: -- bq. :TODO: is the int sort type still needed? ... doesn't seem to be used anywhere, code just tests custom for SortComparator vs Parser. sorry ... badly placed quotes ... that was in referent to Entry.type. Until i changed getStrings, getStringIndex, and getAuto to construct Entry objects as part of my refactoring the type attribute (and the constructor that takes a type argument) didnt' seem to be used anywhere (as far as i could tell) My guess: maybe some previous changes refactored logic that switched on type up into the SortFields?, so the FieldCache no longer needs to care about it? FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
LogByteSizeMergePolicy doesn't keep segments under maxMergeMB - Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1750: - Attachment: LUCENE-1750.patch Unit test illustrating the issue. LogByteSizeMergePolicy doesn't keep segments under maxMergeMB - Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1749: Attachment: LUCENE-1749.patch Here is a start towards guessing the fieldcache ram usage. It probably works fairly well, though it will be limited by stack space on a very heavily nested object graph. I've added the size guess for getValue in the introspection output. Its a start anyway. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732297#action_12732297 ] Mark Miller commented on LUCENE-1749: - We prob would want to provide an alternate toString that includes the ram guess and the default that skips it - i havn't tested performance, but it might take a while to check a gigantic string array. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732297#action_12732297 ] Mark Miller edited comment on LUCENE-1749 at 7/16/09 6:35 PM: -- We prob would want to provide an alternate toString that includes the ram guess and the default that skips it - i havn't tested performance, but it might take a while to check a gigantic string array. Also, JavaImpl should probably actually be JavaMemoryModel or MemoryModel. was (Author: markrmil...@gmail.com): We prob would want to provide an alternate toString that includes the ram guess and the default that skips it - i havn't tested performance, but it might take a while to check a gigantic string array. FieldCache introspection API Key: LUCENE-1749 URL: https://issues.apache.org/jira/browse/LUCENE-1749 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Priority: Minor Fix For: 2.9 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything odd... * entries for the same reader/field with different types/parsers * entries for the same field/type/parser in a reader and it's subreader(s) * etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org