date:20090718

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732869#action_12732869
]

Michael McCandless commented on LUCENE-1693:

bq. Given the difficulty of using it, esp since Lucene has been sorting fields
before analysis (hence you have to name the fields properly to get one to be
indexed before the other), maybe no one is using it.

Can't we fix Tee/Sink so that whichever tee is pulled from first, does the
caching, and then the 2nd one pulls from the cache?

Ie right now when you create them you are forced to commit to which is
primary and which is secondary, but if we relax that then it wouldn't be
sensitive to the order in which Lucene indexed its fields.

Of course, someday Lucene may index fields concurrently, then Tee/Sink'll get
really interesting ;)

AttributeSource/TokenStream API improvements

Key: LUCENE-1693
URL: https://issues.apache.org/jira/browse/LUCENE-1693
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch,
lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
lucene-1693.patch, TestAPIBackwardsCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java,
TestCompatibility.java

This patch makes the following improvements to AttributeSource and
TokenStream/Filter:
- removes the set/getUseNewAPI() methods (including the standard
ones). Instead by default incrementToken() throws a subclass of
UnsupportedOperationException. The indexer tries to call
incrementToken() initially once to see if the exception is thrown;
if so, it falls back to the old API.
- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.
- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface-instance mappings
to the attribute map for each of the found interfaces.
- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.
- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.
The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.
Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning.
Also, the TokenStream API does not change, except for the removal

RE: constant-score rewrite mode for NumericRangeQuery

2009-07-18 Thread Uwe Schindler

Hi Mike,

I did some perf tests with the well-known PerfTest.java from the
FieldCacheRangeFilter JIRA issue.

I compared a 5 mio doc index with precStep=4:

With constant score rewrite: 
avg number of terms: 68.3
TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998
ms; sum=31994466

With boolean rewrite:
avg number of terms: 68.3
TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms;
sum=31994466

Both numbers were taken after some warming up queries, the rand seed was
identical (so exactly same queries). It looks for this index size still
faster than Boolean rewrite. Especially the warmin queries take much longer
with Boolean rewrite. The problem with my test here is, that the whole index
seems to be in OS cache. If it is not in OS cache, I think the much longer
time, the first Boolean queries took, will get more important.

In my opinion, we should keep constant score enabled. My main problem with
Boolean rewrite is the completely useless scoring. A range query should
always have constant score. We could maybe fix this some time in future,
that you can disable scorers for Boolean queries (e.g.
bq.setDoConstantScore(true)). I think this is part of this special issue in
JIRA (do not know the number yet).

A second problem with Boolean rewrite: with precStep=4, it is guaranteed,
that the query will not hit the 1024 max clause problem (see formula with
the theoretical max term number) - so no problem at all. The problem starts,
if you combine 2 or three numeric queries combined by
BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of
a geo query). In this case, the Boolean queries that only consist of MUST
may be combined into one big one (correct me if I am wrong) and then the max
clause count gets a problem.

If we change the default, keep in mind to reopen SOLR-940, as it assumes to
have constant score mode per default and solr's default precStep is 8 -
*bang*. Maybe the solr people should fix this and still explicitely set the
mode for all range queries.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, July 17, 2009 8:56 PM
 To: java-dev@lucene.apache.org
 Subject: constant-score rewrite mode for NumericRangeQuery
 
 Should we really default to constant-score rewrite with NumericRangeQuery?
 
 Would BooleanQuery rewrite mode give better performance on a large
 index, since the number of terms should be smallish w/ the default
 precisionStep (4), I think?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
]

Uwe Schindler commented on LUCENE-1693:
---

In this case we should rename TeeSink to something like SplitTokenStream (which
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks
are only consumers. The first sink that calls to get the attribute states
forces the splitter to harvest and cache the input stream (exactly like
CachingTokenStream does it). In principle it would be the same like a
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the
same instance for different fields. Because the indexer always calls reset()
before consuming, you could re-read it easily. Any additional filters could
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in
parallel. But with the current state it would even not work with Tee/Sink (not
multithread compatible).

AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM:


In this case we should rename TeeSink to something like SplitTokenStream (which 
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenFilter and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenFilter(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream)));
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream)));
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).

  was (Author: thetaphi):
In this case we should rename TeeSink to something like SplitTokenStream 
(which does not extend TokenStream). One could get then any number of sinks 
from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend

Throttling merges

2009-07-18 Thread Jason Rutherglen

It may be useful to allow users to throttle merges. A callback
that IW passes into SegmentMerger would suffice where individual
SM methods make use of the callback. I suppose this could slow
down overall merging by adding a potentially useless method
call. However if merging typically consumes IO resources for an
extended period of time, this offers a way for the user to tune
IO consumption and at preferred times free up IO for other
tasks.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: constant-score rewrite mode for NumericRangeQuery

2009-07-18 Thread Michael McCandless

On Sat, Jul 18, 2009 at 6:54 AM, Uwe Schindleru...@thetaphi.de wrote:

 I did some perf tests with the well-known PerfTest.java from the
 FieldCacheRangeFilter JIRA issue.

 I compared a 5 mio doc index with precStep=4:

 With constant score rewrite:
 avg number of terms: 68.3
 TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998
 ms; sum=31994466

 With boolean rewrite:
 avg number of terms: 68.3
 TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms;
 sum=31994466

 Both numbers were taken after some warming up queries, the rand seed was
 identical (so exactly same queries). It looks for this index size still
 faster than Boolean rewrite.

OK these are good results; thanks for running them!

 Especially the warmin queries take much longer
 with Boolean rewrite. The problem with my test here is, that the whole index
 seems to be in OS cache. If it is not in OS cache, I think the much longer
 time, the first Boolean queries took, will get more important.

Agreed.

 In my opinion, we should keep constant score enabled.

OK +1

 My main problem with
 Boolean rewrite is the completely useless scoring. A range query should
 always have constant score. We could maybe fix this some time in future,
 that you can disable scorers for Boolean queries (e.g.
 bq.setDoConstantScore(true)). I think this is part of this special issue in
 JIRA (do not know the number yet).

I completely agree; we need to make it possible to do BooleanQuery
expansion method with constant scoring (I opened an issue for this
already -- LUCENE-1644).

 A second problem with Boolean rewrite: with precStep=4, it is guaranteed,
 that the query will not hit the 1024 max clause problem (see formula with
 the theoretical max term number) - so no problem at all.

Right.

 The problem starts,
 if you combine 2 or three numeric queries combined by
 BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of
 a geo query). In this case, the Boolean queries that only consist of MUST
 may be combined into one big one (correct me if I am wrong) and then the max
 clause count gets a problem.

Actually Lucene never does structural optimizations of BooleanQuery,
and I think it should (though scores would be different).

One exception: if the BooleanQuery has a single clause, it'll rewrite
itself to the rewrite of that one sub-query.

 If we change the default, keep in mind to reopen SOLR-940, as it assumes to
 have constant score mode per default and solr's default precStep is 8 -
 *bang*. Maybe the solr people should fix this and still explicitely set the
 mode for all range queries.

Let's not change the default :)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Issue Type: Improvement  (was: Bug)
   Summary: Create a MergePolicy that limits the maximum size of it's 
segments  (was: LogByteSizeMergePolicy doesn't keep segments under maxMergeMB)

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Fix Version/s: (was: 2.9)
   3.1

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class

2009-07-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1742:
---

Attachment: LUCENE-1742.patch

Attached patch with tiny changes: made a few more read-only methods public, 
fixed javadoc warning, one formatting fix, added CHANGES.

I think it's ready to commit.  I'll commit soon...

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch, 
 LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Throttling merges

2009-07-18 Thread Michael McCandless

The goal is to be like ionice right?  Meaning, lower the priority of
IO caused by merging?  I agree that makes sense.

I wonder if we could implement it in the Directory level, so that when
openInput/createOutput is called we can optionally specify the
context (reader, merging, writer, etc.), and then somehow add
throttling in there.

Mike

On Sat, Jul 18, 2009 at 10:37 AM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 It may be useful to allow users to throttle merges. A callback
 that IW passes into SegmentMerger would suffice where individual
 SM methods make use of the callback. I suppose this could slow
 down overall merging by adding a potentially useless method
 call. However if merging typically consumes IO resources for an
 extended period of time, this offers a way for the user to tune
 IO consumption and at preferred times free up IO for other
 tasks.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732938#action_12732938
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. We should drop PayloadSpans and just add getPayload to Spans. This should 
be a compile time break.
+1

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.9, 3.0, 3.1


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732974#action_12732974
]

Shai Erera commented on LUCENE-1750:

What happens after several such large segments are created? Wouldn't you want
them to be merged into an even larger segment? Or, you'll have many such
segments and search performance will degrade.

I guess I never thought this is a problem. If I have enough disk space, and my
index size reaches 600 GB (which is a huge index), and is split across 10
different segments of size 60GB each, I guess I'd want them to be merged into
one larger 600GB segment. It will take ions until I'll accumulate another such
600 GB segment, no?

Maybe we can have two merge factors: 1) for small segments, or up to a set size
threshold, where we do the merges regularly. 2) Then, for really large segments
we say the marge factor is different. For example, we can say that up to 1GB
the merge factor is 10, and beyond the merge factor is 20. That will postpone
the large IO merges until enough such segments accumulate.

Also, w/ the current proposal, how will optimize work? Will it skip the very
large segments, or will they be included too?

Create a MergePolicy that limits the maximum size of it's segments
--

Key: LUCENE-1750
URL: https://issues.apache.org/jira/browse/LUCENE-1750
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 3.1

Attachments: LUCENE-1750.patch

Original Estimate: 48h
Remaining Estimate: 48h

Basically I'm trying to create largish 2-4GB shards using
LogByteSizeMergePolicy, however I've found in the attached unit
test segments that exceed maxMergeMB.
The goal is for segments to be merged up to 2GB, then all
merging to that segment stops, and then another 2GB segment is
created. This helps when replicating in Solr where if a single
optimized 60GB segment is created, the machine stops working due
to IO and CPU starvation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

RE: constant-score rewrite mode for NumericRangeQuery

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

Throttling merges

Re: constant-score rewrite mode for NumericRangeQuery

[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class

Re: Throttling merges

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

12 matches

Site Navigation

Mail list logo

Footer information