[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724815#action_12724815 ] Michael McCandless commented on LUCENE-1709: Actually I see decent gains from concurrency: when I run tests with 6 threads my tests run a little over 3X faster (12:59 with 1 thread and 4:15 with 6 threads). I'm using a Python script that launches the threads, each specifying -Dtestpackage to run a certain subset of Lucene's tests. This is on an OpenSolaris (2009.06) machine, with a Core i7 920 CPU (= 8 cores presented to the OS) and an Intel X25M SSD, 12 GB RAM. The hardware has quite a bit of concurrency. > Parallelize Tests > - > > Key: LUCENE-1709 > URL: https://issues.apache.org/jira/browse/LUCENE-1709 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen > Fix For: 3.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > The Lucene tests can be parallelized to make for a faster testing system. > This task from ANT can be used: > http://ant.apache.org/manual/CoreTasks/parallel.html > Previous discussion: > http://www.gossamer-threads.com/lists/lucene/java-dev/69669 > Notes from Mike M.: > {quote} > I'd love to see a clean solution here (the tests are embarrassingly > parallelizable, and we all have machines with good concurrency these > days)... I have a rather hacked up solution now, that uses > "-Dtestpackage=XXX" to split the tests up. > Ideally I would be able to say "use N threads" and it'd do the right > thing... like the -j flag to make. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1630: --- Attachment: LUCENE-1630.patch Added testcase to TestBooleanQuery > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630-2.patch, LUCENE-1630.patch, > LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, > LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialized in > every call to next(), skipTo() and score(). Personally I prefer to see that > in a separate issue, following that one (as it might add methods to > QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr
Re: Improving TimeLimitedCollector
I like the overall approach. However it's very local to an IndexReader. I.e., if someone wanted to limit other operations (say indexing), or does not use an IndexReader (for a Scorer impl maybe), one cannot reuse it. What if we factor out the timeout logic to a Timeout class (I think it can be a static class, with the way you implemented it) and use it in TimeLimitingIndexReader? That class can offer a method check() which will do the internal logic (the 'if' check and throw exception). It will be similar to the current ensureOpen() followed by an operation. It might be considered more expensive since it won't check a boolean, but instead call a check() method, but it will be more reusable. Also, ensureOpen today is also a method call, so I don't think Timeout.check() is that bad. We can even later create a TimeLimitingIndexWriter and document Timeout class for other usage by external code. Aside, how about using a PQ for the threads' times, or a TreeMap. That will save looping over the collection to find the next candidate. Just an implementation detail though. Shai On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood wrote: > Going back to my post re TimeLimitedIndexReaders - here's an incomplete but > functional prototype: > http://www.inperspective.com/lucene/TimeLimitedIndexReader.java > http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java > > > The principle is that all reader accesses check a volatile variable > indicating something may have timed out (no need to check thread locals > etc.) If and only if a time out has been noted threadlocals are checked to > see which thread should throw a timeout exception. > > All time-limited use of reader must be wrapped in try...finally calls to > indicate the start and stop of a timed set of activities. A background > thread maintains the next anticipated timeout deadline and simply waits > until this is reached or the list of planned activities changes with new > deadlines. > > > Performance seems reasonable on my Wikipedia index: > > //some tests for heavy use of termenum/term docs > Read term docs for 20 terms in 4755 ms using no timeout limit (warm > up) > Read term docs for 20 terms in 4320 ms using no timeout limit (warm > up) > Read term docs for 20 terms in 4320 ms using no timeout limit > Read term docs for 20 terms in 4388 ms using reader with time-limited > access > > //Example query with heavy use of termEnum/termDocs > +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms > +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in > 1963 ms > +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in 2121 > ms > > //Example fuzzy match burning CPU reading TermEnum > text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms > text:accomodation~0.5 time limited collector matched 192084 docs in 5923 > ms > text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms > > > The reader approach to limiting time is slower but has these advantages : > > 1) Multiple reader activities can be time-limited rather than just single > searches > 2) No code changes required to scorers/queries/filters etc > 3) Tasks that spend plenty of time burning CPU before collection happens > can be killed earlier > > I'm sure there's some thread safety issues to work through in my code and > not all reader classes are wrapped (e.g. TermPositions) but the basics are > there and seem to be functioning > > Thoughts? >
[jira] Resolved: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1630. Resolution: Fixed Super -- I just committed this; thanks Shai. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630-2.patch, LUCENE-1630.patch, > LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, > LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialized in > every call to next(), skipTo() and score(). Personally I prefer to see that > in a separate issue, following that one (as it might add methods to > QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: j
Re: Improving TimeLimitedCollector
Thanks for the feedback, Shai. So I guess you're suggesting breaking this out into a general utility class e.g. something like: class TimeLimitedThreadActivity { //called by client public static void startTimeLimitedActivity(long maxTimePermitted). public static void endTimeLimitedActivity() //called by resources (reader/writers) that need to be shared fairly by threads public static void checkActivityNotElapsed(); //throws some form of runtime exception } A downside of breaking it out into static methods like this is that a thread cannot run >1 time-limited activity simultaneously but I guess that might be a reasonable restriction. >>Aside, how about using a PQ for the threads' times, or a TreeMap. That will save looping over the collection to find the next candidate. Just an implementation detail though. Yep, that was one of the rough edges - I just wanted to get raw timings first for the all the "is timed out?" checks we're injecting into reader calls. Cheers Mark On 27 Jun 2009, at 11:37, Shai Erera wrote: I like the overall approach. However it's very local to an IndexReader. I.e., if someone wanted to limit other operations (say indexing), or does not use an IndexReader (for a Scorer impl maybe), one cannot reuse it. What if we factor out the timeout logic to a Timeout class (I think it can be a static class, with the way you implemented it) and use it in TimeLimitingIndexReader? That class can offer a method check() which will do the internal logic (the 'if' check and throw exception). It will be similar to the current ensureOpen() followed by an operation. It might be considered more expensive since it won't check a boolean, but instead call a check() method, but it will be more reusable. Also, ensureOpen today is also a method call, so I don't think Timeout.check() is that bad. We can even later create a TimeLimitingIndexWriter and document Timeout class for other usage by external code. Aside, how about using a PQ for the threads' times, or a TreeMap. That will save looping over the collection to find the next candidate. Just an implementation detail though. Shai On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood wrote: Going back to my post re TimeLimitedIndexReaders - here's an incomplete but functional prototype: http://www.inperspective.com/lucene/TimeLimitedIndexReader.java http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java The principle is that all reader accesses check a volatile variable indicating something may have timed out (no need to check thread locals etc.) If and only if a time out has been noted threadlocals are checked to see which thread should throw a timeout exception. All time-limited use of reader must be wrapped in try...finally calls to indicate the start and stop of a timed set of activities. A background thread maintains the next anticipated timeout deadline and simply waits until this is reached or the list of planned activities changes with new deadlines. Performance seems reasonable on my Wikipedia index: //some tests for heavy use of termenum/term docs Read term docs for 20 terms in 4755 ms using no timeout limit (warm up) Read term docs for 20 terms in 4320 ms using no timeout limit (warm up) Read term docs for 20 terms in 4320 ms using no timeout limit Read term docs for 20 terms in 4388 ms using reader with time- limited access //Example query with heavy use of termEnum/termDocs +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in 1963 ms +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in 2121 ms //Example fuzzy match burning CPU reading TermEnum text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms text:accomodation~0.5 time limited collector matched 192084 docs in 5923 ms text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms The reader approach to limiting time is slower but has these advantages : 1) Multiple reader activities can be time-limited rather than just single searches 2) No code changes required to scorers/queries/filters etc 3) Tasks that spend plenty of time burning CPU before collection happens can be killed earlier I'm sure there's some thread safety issues to work through in my code and not all reader classes are wrapped (e.g. TermPositions) but the basics are there and seem to be functioning Thoughts?
Re: Improving TimeLimitedCollector
> > A downside of breaking it out into static methods like this is that a > thread cannot run >1 time-limited activity simultaneously but I guess that > might be a reasonable restriction. > I'm not sure I understand that - how can a thread run >1 activity simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows it to do so? You use the thread as a key to the map. Am I missing something? Anyway, I think we can let go of the static methods and make them instance methods. I think .. if I want to use time limited activities, I should create a TimeLimitedThreadActivity instance and pass it around, to TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter) and any other custom code I have which I want to put a time limit on. A static class has the advantage of not needing to pass it around everywhere, and is accessible from everywhere, so that if we discover that limiting on IndexReader is not enough, and we want some of the scorers to check more frequently if they should stop, we won't need to pass that instance all the way down to them. I don't mind keeping it static, but I also don't mind if it will be an instance passed around, since currently it's only passed to IndexReader. Are you going to open an issue for that? Seems like a nice addition to me. Do you think it should belong in core or contrib? If 'core' then if that's possible I'd like to see all timeout classes under one package, including TimeLimitingCollector (which until 2.9 we can safely move around as we want). I don't mind working on that w/ you, if you want. Shai On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood wrote: > Thanks for the feedback, Shai. > So I guess you're suggesting breaking this out into a general utility class > e.g. something like: > > class TimeLimitedThreadActivity > { > //called by client > public static void startTimeLimitedActivity(long maxTimePermitted). > public static void endTimeLimitedActivity() > >//called by resources (reader/writers) that need to be shared fairly > by threads > public static void checkActivityNotElapsed(); //throws some form of > runtime exception > } > > A downside of breaking it out into static methods like this is that a > thread cannot run >1 time-limited activity simultaneously but I guess that > might be a reasonable restriction. > > > >>Aside, how about using a PQ for the threads' times, or a TreeMap. That > will save looping over the collection to find the next candidate. Just an > implementation detail though. > > Yep, that was one of the rough edges - I just wanted to get raw timings > first for the all the "is timed out?" checks we're injecting into reader > calls. > > Cheers > Mark > > > On 27 Jun 2009, at 11:37, Shai Erera wrote: > > I like the overall approach. However it's very local to an IndexReader. > I.e., if someone wanted to limit other operations (say indexing), or does > not use an IndexReader (for a Scorer impl maybe), one cannot reuse it. > > What if we factor out the timeout logic to a Timeout class (I think it can > be a static class, with the way you implemented it) and use it in > TimeLimitingIndexReader? That class can offer a method check() which will do > the internal logic (the 'if' check and throw exception). It will be similar > to the current ensureOpen() followed by an operation. > > It might be considered more expensive since it won't check a boolean, but > instead call a check() method, but it will be more reusable. Also, > ensureOpen today is also a method call, so I don't think Timeout.check() is > that bad. We can even later create a TimeLimitingIndexWriter and document > Timeout class for other usage by external code. > > Aside, how about using a PQ for the threads' times, or a TreeMap. That will > save looping over the collection to find the next candidate. Just an > implementation detail though. > > Shai > > On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood wrote: > >> Going back to my post re TimeLimitedIndexReaders - here's an incomplete >> but functional prototype: >> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java >> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java >> >> >> The principle is that all reader accesses check a volatile variable >> indicating something may have timed out (no need to check thread locals >> etc.) If and only if a time out has been noted threadlocals are checked to >> see which thread should throw a timeout exception. >> >> All time-limited use of reader must be wrapped in try...finally calls to >> indicate the start and stop of a timed set of activities. A background >> thread maintains the next anticipated timeout deadline and simply waits >> until this is reached or the list of planned activities changes with new >> deadlines. >> >> >> Performance seems reasonable on my Wikipedia index: >> >> //some tests for heavy use of termenum/term docs >> Read term docs for 20 terms in 4755 ms using no timeout limit
Re: Improving TimeLimitedCollector
Why don't you use Thread.interrupt(), .isInterrupted() ? On Sat, Jun 27, 2009 at 16:16, Shai Erera wrote: >> A downside of breaking it out into static methods like this is that a >> thread cannot run >1 time-limited activity simultaneously but I guess that >> might be a reasonable restriction. > > I'm not sure I understand that - how can a thread run >1 activity > simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows > it to do so? You use the thread as a key to the map. Am I missing something? > > Anyway, I think we can let go of the static methods and make them instance > methods. I think .. if I want to use time limited activities, I should > create a TimeLimitedThreadActivity instance and pass it around, to > TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter) > and any other custom code I have which I want to put a time limit on. > > A static class has the advantage of not needing to pass it around > everywhere, and is accessible from everywhere, so that if we discover that > limiting on IndexReader is not enough, and we want some of the scorers to > check more frequently if they should stop, we won't need to pass that > instance all the way down to them. > > I don't mind keeping it static, but I also don't mind if it will be an > instance passed around, since currently it's only passed to IndexReader. > > Are you going to open an issue for that? Seems like a nice addition to me. > Do you think it should belong in core or contrib? If 'core' then if that's > possible I'd like to see all timeout classes under one package, including > TimeLimitingCollector (which until 2.9 we can safely move around as we > want). > I don't mind working on that w/ you, if you want. > > Shai > > On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood > wrote: >> >> Thanks for the feedback, Shai. >> So I guess you're suggesting breaking this out into a general utility >> class e.g. something like: >> class TimeLimitedThreadActivity >> { >> //called by client >> public static void startTimeLimitedActivity(long >> maxTimePermitted). >> public static void endTimeLimitedActivity() >> //called by resources (reader/writers) that need to be shared >> fairly by threads >> public static void checkActivityNotElapsed(); //throws some form of >> runtime exception >> } >> A downside of breaking it out into static methods like this is that a >> thread cannot run >1 time-limited activity simultaneously but I guess that >> might be a reasonable restriction. >> >> >>Aside, how about using a PQ for the threads' times, or a TreeMap. That >> >> will save looping over the collection to find the next candidate. Just an >> >> implementation detail though. >> Yep, that was one of the rough edges - I just wanted to get raw timings >> first for the all the "is timed out?" checks we're injecting into reader >> calls. >> Cheers >> Mark >> >> On 27 Jun 2009, at 11:37, Shai Erera wrote: >> >> I like the overall approach. However it's very local to an IndexReader. >> I.e., if someone wanted to limit other operations (say indexing), or does >> not use an IndexReader (for a Scorer impl maybe), one cannot reuse it. >> >> What if we factor out the timeout logic to a Timeout class (I think it can >> be a static class, with the way you implemented it) and use it in >> TimeLimitingIndexReader? That class can offer a method check() which will do >> the internal logic (the 'if' check and throw exception). It will be similar >> to the current ensureOpen() followed by an operation. >> >> It might be considered more expensive since it won't check a boolean, but >> instead call a check() method, but it will be more reusable. Also, >> ensureOpen today is also a method call, so I don't think Timeout.check() is >> that bad. We can even later create a TimeLimitingIndexWriter and document >> Timeout class for other usage by external code. >> >> Aside, how about using a PQ for the threads' times, or a TreeMap. That >> will save looping over the collection to find the next candidate. Just an >> implementation detail though. >> >> Shai >> >> On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood >> wrote: >>> >>> Going back to my post re TimeLimitedIndexReaders - here's an incomplete >>> but functional prototype: >>> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java >>> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java >>> >>> The principle is that all reader accesses check a volatile variable >>> indicating something may have timed out (no need to check thread locals >>> etc.) If and only if a time out has been noted threadlocals are checked to >>> see which thread should throw a timeout exception. >>> All time-limited use of reader must be wrapped in try...finally calls to >>> indicate the start and stop of a timed set of activities. A background >>> thread maintains the next anticipated timeout deadline and simply waits >>> until this is reached or the list of planned acti
Re: Improving TimeLimitedCollector
Odd. I see you're responding to a message from Shai I didn't get. Some mail being dropped somewhere along the line.. Why don't you use Thread.interrupt(), .isInterrupted() ? Not sure where exactly you mean for that? I'm not sure I understand that - how can a thread run >1 activity simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows it to do so? I can have a single thread performing using multiple readers/writers and each reader/writer has different timed activities e.g. reader1.startActivity(6) try do reader 1 stuff reader2.startActivity(1); try do reader 2 stuff. finally reader2.endActivity(); finally reader1.endActivity(); Each reader has it's own list of threads and their anticipated timeouts in relation to that particular resource. Bit of an edge case perhaps but anything based on statics precludes this type of code. An instance of a timeout object would give this flexibility but as you say - you would need to pass a timeoutable- activity context object around which is a bit more hassle. Flexibility vs simplicity decision to be made here. I don't mind working on that w/ you, if you want. Appreciated. I'll move the timeout management logic out of reader into another class and open a Jira issue so we can move the discussion/ development there. Cheers Mark
[jira] Created: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter --- Key: LUCENE-1719 URL: https://issues.apache.org/jira/browse/LUCENE-1719 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1719: Attachment: LUCENE-1719.patch Patch containing notes to add to collation key filter/analyzer classes' javadocs. > Add javadoc notes about ICUCollationKeyFilter's speed advantage over > CollationKeyFilter > --- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination, and then subtracted it from both of the collation key analysis > chains' times. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / > (JVM-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|2.6x| > |1.4.2_17 (32 bit)|French|716|243|14|3.1x| > |1.4.2_17 (32 bit)|German|669|264|16|2.6x| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| > |1.5.0_15 (32 bit)|English|604|176|16|3.7x| > |1.5.0_15 (32 bit)|French|817|209|17|4.2x| > |1.5.0_15 (32 bit)|German|799|225|20|3.8x| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| > |1.5.0_15 (64 bit)|English|431|89|10|5.3x| > |1.5.0_15 (64 bit)|French|562|112|11|5.5x| > |1.5.0_15 (64 bit)|German|567|116|13|5.4x| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| > |1.6.0_13 (64 bit)|English|162|81|9|2.1x| > |1.6.0_13 (64 bit)|French|192|92|10|2.2x| > |1.6.0_13 (64 bit)|German|204|99|14|2.2x| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1719: Description: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| was: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| Lucene Fields: [New, Patch Available] (was: [New]) > Add javadoc notes about ICUCollationKeyFilter's speed advantage over > CollationKeyFilter > --- > > Key: LUCENE-1719 > URL: https:/
[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724923#action_12724923 ] Steven Rowe commented on LUCENE-1719: - I also tested ICU4J version 4.2 (released 6 weeks ago), and the timings were nearly identical to those from ICU4J version 4.0 (the one that's in contrib/collation/lib/). The timings given in the table above were not produced with the "-server" option to the JVM. I separately tested all combinations using the "-server" option, but there was no difference for the 32-bit JVMs, though roughly 3-4% faster for the 64-bit JVMs. I got the impression (didn't actually calculate) that although the best times of 5 runs were better for the 64-bit JVMs when using the "-server" option, the average times seemed to be slightly worse. In any case, the performance improvement of the ICU4J implementation over the java.text.Collator implementation was basically unaffected by the use of the "-server" JVM option. > Add javadoc notes about ICUCollationKeyFilter's speed advantage over > CollationKeyFilter > --- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / > (JVM-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|2.6x| > |1.4.2_17 (32 bit)|French|716|243|14|3.1x| > |1.4.2_17 (32 bit)|German|669|264|16|2.6x| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| > |1.5.0_15 (32 bit)|English|604|176|16|3.7x| > |1.5.0_15 (32 bit)|French|817|209|17|4.2x| > |1.5.0_15 (32 bit)|German|799|225|20|3.8x| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| > |1.5.0_15 (64 bit)|English|431|89|10|5.3x| > |1.5.0_15 (64 bit)|French|562|112|11|5.5x| > |1.5.0_15 (64 bit)|German|567|116|13|5.4x| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| > |1.6.0_13 (64 bit)|English|162|81|9|2.1x| > |1.6.0_13 (64 bit)|French|192|92|10|2.2x| > |1.6.0_13 (64 bit)|German|204|99|14|2.2x| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724926#action_12724926 ] Steven Rowe commented on LUCENE-1581: - {quote} you could add the JDK collation key filter to core if you wanted a core fix. but the icu one is up to something like 30x faster than the jdk, so why bother :) {quote} LUCENE-1719 contains some timings I made about the relative speeds of these two implementations. In short, for the platform/language/collator/JVM version combinations I tested, the ICU4J implementation's speed advantage ranges from 1.4x to 5.5x. > LowerCaseFilter should be able to be configured to use a specific locale. > - > > Key: LUCENE-1581 > URL: https://issues.apache.org/jira/browse/LUCENE-1581 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Digy > Attachments: TestTurkishCollation.java > > > //Since I am a .Net programmer, Sample codes will be in c# but I don't think > that it would be a problem to understand them. > // > Assume an input text like "İ" and and analyzer like below > {code} > public class SomeAnalyzer : Analyzer > { > public override TokenStream TokenStream(string fieldName, > System.IO.TextReader reader) > { > TokenStream t = new SomeTokenizer(reader); > t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); > t = new LowerCaseFilter(t); > return t; > } > > } > {code} > > ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return > "i" (if locale is "en-US") > or > "ı' if(locale is "tr-TR") (that means,this token should be input to > another instance of ASCIIFoldingFilter) > So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, > but a better approach can be adding > a new constructor to LowerCaseFilter and forcing it to use a specific locale. > {code} > public sealed class LowerCaseFilter : TokenFilter > { > /* +++ */System.Globalization.CultureInfo CultureInfo = > System.Globalization.CultureInfo.CurrentCulture; > public LowerCaseFilter(TokenStream in) : base(in) > { > } > /* +++ */ public LowerCaseFilter(TokenStream in, > System.Globalization.CultureInfo CultureInfo) : base(in) > /* +++ */ { > /* +++ */ this.CultureInfo = CultureInfo; > /* +++ */ } > > public override Token Next(Token result) > { > result = Input.Next(result); > if (result != null) > { > char[] buffer = result.TermBuffer(); > int length = result.termLength; > for (int i = 0; i < length; i++) > /* +++ */ buffer[i] = > System.Char.ToLower(buffer[i],CultureInfo); > return result; > } > else > return null; > } > } > {code} > DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org