[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732869#action_12732869
 ] 

Michael McCandless commented on LUCENE-1693:


bq. Given the difficulty of using it, esp since Lucene has been sorting fields 
before analysis (hence you have to name the fields properly to get one to be 
indexed before the other), maybe no one is using it.

Can't we fix Tee/Sink so that whichever tee is pulled from first, does the 
caching, and then the 2nd one pulls from the cache?

Ie right now when you create them you are forced to commit to which is 
primary and which is secondary, but if we relax that then it wouldn't be 
sensitive to the order in which Lucene indexed its fields.

Of course, someday Lucene may index fields concurrently, then Tee/Sink'll get 
really interesting ;)

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 

RE: constant-score rewrite mode for NumericRangeQuery

2009-07-18 Thread Uwe Schindler
Hi Mike,

I did some perf tests with the well-known PerfTest.java from the
FieldCacheRangeFilter JIRA issue.

I compared a 5 mio doc index with precStep=4:

With constant score rewrite: 
avg number of terms: 68.3
TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998
ms; sum=31994466

With boolean rewrite:
avg number of terms: 68.3
TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms;
sum=31994466

Both numbers were taken after some warming up queries, the rand seed was
identical (so exactly same queries). It looks for this index size still
faster than Boolean rewrite. Especially the warmin queries take much longer
with Boolean rewrite. The problem with my test here is, that the whole index
seems to be in OS cache. If it is not in OS cache, I think the much longer
time, the first Boolean queries took, will get more important.

In my opinion, we should keep constant score enabled. My main problem with
Boolean rewrite is the completely useless scoring. A range query should
always have constant score. We could maybe fix this some time in future,
that you can disable scorers for Boolean queries (e.g.
bq.setDoConstantScore(true)). I think this is part of this special issue in
JIRA (do not know the number yet).

A second problem with Boolean rewrite: with precStep=4, it is guaranteed,
that the query will not hit the 1024 max clause problem (see formula with
the theoretical max term number) - so no problem at all. The problem starts,
if you combine 2 or three numeric queries combined by
BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of
a geo query). In this case, the Boolean queries that only consist of MUST
may be combined into one big one (correct me if I am wrong) and then the max
clause count gets a problem.

If we change the default, keep in mind to reopen SOLR-940, as it assumes to
have constant score mode per default and solr's default precStep is 8 -
*bang*. Maybe the solr people should fix this and still explicitely set the
mode for all range queries.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, July 17, 2009 8:56 PM
 To: java-dev@lucene.apache.org
 Subject: constant-score rewrite mode for NumericRangeQuery
 
 Should we really default to constant-score rewrite with NumericRangeQuery?
 
 Would BooleanQuery rewrite mode give better performance on a large
 index, since the number of terms should be smallish w/ the default
 precisionStep (4), I think?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
 ] 

Uwe Schindler commented on LUCENE-1693:
---

In this case we should rename TeeSink to something like SplitTokenStream (which 
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM:


In this case we should rename TeeSink to something like SplitTokenStream (which 
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenFilter and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenFilter(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream)));
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream)));
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).

  was (Author: thetaphi):
In this case we should rename TeeSink to something like SplitTokenStream 
(which does not extend TokenStream). One could get then any number of sinks 
from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend 

Throttling merges

2009-07-18 Thread Jason Rutherglen
It may be useful to allow users to throttle merges. A callback
that IW passes into SegmentMerger would suffice where individual
SM methods make use of the callback. I suppose this could slow
down overall merging by adding a potentially useless method
call. However if merging typically consumes IO resources for an
extended period of time, this offers a way for the user to tune
IO consumption and at preferred times free up IO for other
tasks.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: constant-score rewrite mode for NumericRangeQuery

2009-07-18 Thread Michael McCandless
On Sat, Jul 18, 2009 at 6:54 AM, Uwe Schindleru...@thetaphi.de wrote:

 I did some perf tests with the well-known PerfTest.java from the
 FieldCacheRangeFilter JIRA issue.

 I compared a 5 mio doc index with precStep=4:

 With constant score rewrite:
 avg number of terms: 68.3
 TRIE: best time=6.192687 ms; worst time=463.0907 ms; avg=222.6431290998
 ms; sum=31994466

 With boolean rewrite:
 avg number of terms: 68.3
 TRIE: best time=12.674237 ms; worst time=583.702957 ms; avg=257.912947 ms;
 sum=31994466

 Both numbers were taken after some warming up queries, the rand seed was
 identical (so exactly same queries). It looks for this index size still
 faster than Boolean rewrite.

OK these are good results; thanks for running them!

 Especially the warmin queries take much longer
 with Boolean rewrite. The problem with my test here is, that the whole index
 seems to be in OS cache. If it is not in OS cache, I think the much longer
 time, the first Boolean queries took, will get more important.

Agreed.

 In my opinion, we should keep constant score enabled.

OK +1

 My main problem with
 Boolean rewrite is the completely useless scoring. A range query should
 always have constant score. We could maybe fix this some time in future,
 that you can disable scorers for Boolean queries (e.g.
 bq.setDoConstantScore(true)). I think this is part of this special issue in
 JIRA (do not know the number yet).

I completely agree; we need to make it possible to do BooleanQuery
expansion method with constant scoring (I opened an issue for this
already -- LUCENE-1644).

 A second problem with Boolean rewrite: with precStep=4, it is guaranteed,
 that the query will not hit the 1024 max clause problem (see formula with
 the theoretical max term number) - so no problem at all.

Right.

 The problem starts,
 if you combine 2 or three numeric queries combined by
 BooleanClaus.Occur.MUST in a top-level Boolean query (the typical example of
 a geo query). In this case, the Boolean queries that only consist of MUST
 may be combined into one big one (correct me if I am wrong) and then the max
 clause count gets a problem.

Actually Lucene never does structural optimizations of BooleanQuery,
and I think it should (though scores would be different).

One exception: if the BooleanQuery has a single clause, it'll rewrite
itself to the rewrite of that one sub-query.

 If we change the default, keep in mind to reopen SOLR-940, as it assumes to
 have constant score mode per default and solr's default precStep is 8 -
 *bang*. Maybe the solr people should fix this and still explicitely set the
 mode for all range queries.

Let's not change the default :)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Issue Type: Improvement  (was: Bug)
   Summary: Create a MergePolicy that limits the maximum size of it's 
segments  (was: LogByteSizeMergePolicy doesn't keep segments under maxMergeMB)

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Fix Version/s: (was: 2.9)
   3.1

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class

2009-07-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1742:
---

Attachment: LUCENE-1742.patch

Attached patch with tiny changes: made a few more read-only methods public, 
fixed javadoc warning, one formatting fix, added CHANGES.

I think it's ready to commit.  I'll commit soon...

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch, 
 LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Throttling merges

2009-07-18 Thread Michael McCandless
The goal is to be like ionice right?  Meaning, lower the priority of
IO caused by merging?  I agree that makes sense.

I wonder if we could implement it in the Directory level, so that when
openInput/createOutput is called we can optionally specify the
context (reader, merging, writer, etc.), and then somehow add
throttling in there.

Mike

On Sat, Jul 18, 2009 at 10:37 AM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 It may be useful to allow users to throttle merges. A callback
 that IW passes into SegmentMerger would suffice where individual
 SM methods make use of the callback. I suppose this could slow
 down overall merging by adding a potentially useless method
 call. However if merging typically consumes IO resources for an
 extended period of time, this offers a way for the user to tune
 IO consumption and at preferred times free up IO for other
 tasks.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732938#action_12732938
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. We should drop PayloadSpans and just add getPayload to Spans. This should 
be a compile time break.
+1

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.9, 3.0, 3.1


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732974#action_12732974
 ] 

Shai Erera commented on LUCENE-1750:


What happens after several such large segments are created? Wouldn't you want 
them to be merged into an even larger segment? Or, you'll have many such 
segments and search performance will degrade.

I guess I never thought this is a problem. If I have enough disk space, and my 
index size reaches 600 GB (which is a huge index), and is split across 10 
different segments of size 60GB each, I guess I'd want them to be merged into 
one larger 600GB segment. It will take ions until I'll accumulate another such 
600 GB segment, no?

Maybe we can have two merge factors: 1) for small segments, or up to a set size 
threshold, where we do the merges regularly. 2) Then, for really large segments 
we say the marge factor is different. For example, we can say that up to 1GB 
the merge factor is 10, and beyond the merge factor is 20. That will postpone 
the large IO merges until enough such segments accumulate.

Also, w/ the current proposal, how will optimize work? Will it skip the very 
large segments, or will they be included too?

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org