[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732869#action_12732869 ] Michael McCandless commented on LUCENE-1693: bq. Given the difficulty of using it, esp since Lucene has been sorting fields before analysis (hence you have to name the fields properly to get one to be indexed before the other), maybe no one is using it. Can't we fix Tee/Sink so that whichever tee is pulled from first, does the caching, and then the 2nd one pulls from the cache? Ie right now when you create them you are forced to commit to which is primary and which is secondary, but if we relax that then it wouldn't be sensitive to the order in which Lucene indexed its fields. Of course, someday Lucene may index fields concurrently, then Tee/Sink'll get really interesting ;) AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal
RE: constant-score rewrite mode for NumericRangeQuery
Hi Mike, I did some perf tests with the well-known PerfTest.java from the FieldCacheRangeFilter JIRA issue. I compared a 5 mio doc index with precStep=4: With constant score rewrite: avg number of terms: 68.3 TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998 ms; sum=31994466 With boolean rewrite: avg number of terms: 68.3 TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms; sum=31994466 Both numbers were taken after some warming up queries, the rand seed was identical (so exactly same queries). It looks for this index size still faster than Boolean rewrite. Especially the warmin queries take much longer with Boolean rewrite. The problem with my test here is, that the whole index seems to be in OS cache. If it is not in OS cache, I think the much longer time, the first Boolean queries took, will get more important. In my opinion, we should keep constant score enabled. My main problem with Boolean rewrite is the completely useless scoring. A range query should always have constant score. We could maybe fix this some time in future, that you can disable scorers for Boolean queries (e.g. bq.setDoConstantScore(true)). I think this is part of this special issue in JIRA (do not know the number yet). A second problem with Boolean rewrite: with precStep=4, it is guaranteed, that the query will not hit the 1024 max clause problem (see formula with the theoretical max term number) - so no problem at all. The problem starts, if you combine 2 or three numeric queries combined by BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of a geo query). In this case, the Boolean queries that only consist of MUST may be combined into one big one (correct me if I am wrong) and then the max clause count gets a problem. If we change the default, keep in mind to reopen SOLR-940, as it assumes to have constant score mode per default and solr's default precStep is 8 - *bang*. Maybe the solr people should fix this and still explicitely set the mode for all range queries. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, July 17, 2009 8:56 PM To: java-dev@lucene.apache.org Subject: constant-score rewrite mode for NumericRangeQuery Should we really default to constant-score rewrite with NumericRangeQuery? Would BooleanQuery rewrite mode give better performance on a large index, since the number of terms should be smallish w/ the default precisionStep (4), I think? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878 ] Uwe Schindler commented on LUCENE-1693: --- In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of sinks from it: {code} SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ... {code} In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream. But on the other hand: You can always create a CachingTokenStream and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important: {code} TokenStream stream=new CachingTokenStream(input); doc.add(new Field(xyz, new DoSomethingTokenFilter(stream); doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream); ... {code} This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible). AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878 ] Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM: In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of sinks from it: {code} SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ... {code} In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream. But on the other hand: You can always create a CachingTokenFilter and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important: {code} TokenStream stream=new CachingTokenFilter(input); doc.add(new Field(xyz, new DoSomethingTokenFilter(stream))); doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream))); ... {code} This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible). was (Author: thetaphi): In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of sinks from it: {code} SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ... {code} In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream. But on the other hand: You can always create a CachingTokenStream and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important: {code} TokenStream stream=new CachingTokenStream(input); doc.add(new Field(xyz, new DoSomethingTokenFilter(stream); doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream); ... {code} This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible). AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend
Throttling merges
It may be useful to allow users to throttle merges. A callback that IW passes into SegmentMerger would suffice where individual SM methods make use of the callback. I suppose this could slow down overall merging by adding a potentially useless method call. However if merging typically consumes IO resources for an extended period of time, this offers a way for the user to tune IO consumption and at preferred times free up IO for other tasks. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: constant-score rewrite mode for NumericRangeQuery
On Sat, Jul 18, 2009 at 6:54 AM, Uwe Schindleru...@thetaphi.de wrote: I did some perf tests with the well-known PerfTest.java from the FieldCacheRangeFilter JIRA issue. I compared a 5 mio doc index with precStep=4: With constant score rewrite: avg number of terms: 68.3 TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998 ms; sum=31994466 With boolean rewrite: avg number of terms: 68.3 TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms; sum=31994466 Both numbers were taken after some warming up queries, the rand seed was identical (so exactly same queries). It looks for this index size still faster than Boolean rewrite. OK these are good results; thanks for running them! Especially the warmin queries take much longer with Boolean rewrite. The problem with my test here is, that the whole index seems to be in OS cache. If it is not in OS cache, I think the much longer time, the first Boolean queries took, will get more important. Agreed. In my opinion, we should keep constant score enabled. OK +1 My main problem with Boolean rewrite is the completely useless scoring. A range query should always have constant score. We could maybe fix this some time in future, that you can disable scorers for Boolean queries (e.g. bq.setDoConstantScore(true)). I think this is part of this special issue in JIRA (do not know the number yet). I completely agree; we need to make it possible to do BooleanQuery expansion method with constant scoring (I opened an issue for this already -- LUCENE-1644). A second problem with Boolean rewrite: with precStep=4, it is guaranteed, that the query will not hit the 1024 max clause problem (see formula with the theoretical max term number) - so no problem at all. Right. The problem starts, if you combine 2 or three numeric queries combined by BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of a geo query). In this case, the Boolean queries that only consist of MUST may be combined into one big one (correct me if I am wrong) and then the max clause count gets a problem. Actually Lucene never does structural optimizations of BooleanQuery, and I think it should (though scores would be different). One exception: if the BooleanQuery has a single clause, it'll rewrite itself to the rewrite of that one sub-query. If we change the default, keep in mind to reopen SOLR-940, as it assumes to have constant score mode per default and solr's default precStep is 8 - *bang*. Maybe the solr people should fix this and still explicitely set the mode for all range queries. Let's not change the default :) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1750: - Issue Type: Improvement (was: Bug) Summary: Create a MergePolicy that limits the maximum size of it's segments (was: LogByteSizeMergePolicy doesn't keep segments under maxMergeMB) Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1750: - Fix Version/s: (was: 2.9) 3.1 Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class
[ https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1742: --- Attachment: LUCENE-1742.patch Attached patch with tiny changes: made a few more read-only methods public, fixed javadoc warning, one formatting fix, added CHANGES. I think it's ready to commit. I'll commit soon... Wrap SegmentInfos in public class -- Key: LUCENE-1742 URL: https://issues.apache.org/jira/browse/LUCENE-1742 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch Original Estimate: 48h Remaining Estimate: 48h Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not need to be in the org.apache.lucene.index package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Throttling merges
The goal is to be like ionice right? Meaning, lower the priority of IO caused by merging? I agree that makes sense. I wonder if we could implement it in the Directory level, so that when openInput/createOutput is called we can optionally specify the context (reader, merging, writer, etc.), and then somehow add throttling in there. Mike On Sat, Jul 18, 2009 at 10:37 AM, Jason Rutherglenjason.rutherg...@gmail.com wrote: It may be useful to allow users to throttle merges. A callback that IW passes into SegmentMerger would suffice where individual SM methods make use of the callback. I suppose this could slow down overall merging by adding a potentially useless method call. However if merging typically consumes IO resources for an extended period of time, this offers a way for the user to tune IO consumption and at preferred times free up IO for other tasks. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732938#action_12732938 ] Earwin Burrfoot commented on LUCENE-1748: - bq. We should drop PayloadSpans and just add getPayload to Spans. This should be a compile time break. +1 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.9, 3.0, 3.1 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732974#action_12732974 ] Shai Erera commented on LUCENE-1750: What happens after several such large segments are created? Wouldn't you want them to be merged into an even larger segment? Or, you'll have many such segments and search performance will degrade. I guess I never thought this is a problem. If I have enough disk space, and my index size reaches 600 GB (which is a huge index), and is split across 10 different segments of size 60GB each, I guess I'd want them to be merged into one larger 600GB segment. It will take ions until I'll accumulate another such 600 GB segment, no? Maybe we can have two merge factors: 1) for small segments, or up to a set size threshold, where we do the merges regularly. 2) Then, for really large segments we say the marge factor is different. For example, we can say that up to 1GB the merge factor is 10, and beyond the merge factor is 20. That will postpone the large IO merges until enough such segments accumulate. Also, w/ the current proposal, how will optimize work? Will it skip the very large segments, or will they be included too? Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org