[jira] Updated: (LUCENE-1425) Add ConstantScore highlighting support to SpanScorer
[ https://issues.apache.org/jira/browse/LUCENE-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1425: --- Fix Version/s: 2.9 Add ConstantScore highlighting support to SpanScorer Key: LUCENE-1425 URL: https://issues.apache.org/jira/browse/LUCENE-1425 Project: Lucene - Java Issue Type: New Feature Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: LUCENE-1425.patch, LUCENE-1425.patch Its actually easy enough to support the family of constantscore queries with the new SpanScorer. This will also remove the requirement that you rewrite queries against the main index before highlighting (in fact, if you do, the constantscore queries will not highlight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques
[ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693414#action_12693414 ] Michael McCandless commented on LUCENE-1577: Are these tests measuring adding a single doc, then searching on it? What are the numbers you measure in the results (eg 25882 for LuceneRealtimeWriter)? I think we need a more realistic test for near real-time search, but I'm not sure exactly what that is. In LUCENE-1516 I've added a benchmark task to periodically open a new near real-time reader from the writer, and then tested it while doing bulk indexing. But that's not a typical test, I think (normally bulk indexing is done up front, and only a trickle of updates to doc are then done for near real-time search). Maybe we just need an updateDocument task, which randomly picks a doc (identified by a primary-key docid field) and replaces it. Then, benchmark already has the ability to rate-limit how frequently docs are updated. Benchmark of different in RAM realtime techniques - Key: LUCENE-1577 URL: https://issues.apache.org/jira/browse/LUCENE-1577 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1577.patch Original Estimate: 168h Remaining Estimate: 168h A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
NIO.2
NIO.2 sounds great. Though, it will probably take a pretty long time before we can switch Lucene to Java 1.7 :( We could write a (contrib) module that we don't ship together with the core that has a Directory implementation which uses NIO.2. http://jcp.org/en/jsr/detail?id=203 http://ronsoft.net/files/WhatsNewNIO2.pdf -Michael
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693443#action_12693443 ] Shai Erera commented on LUCENE-1575: {quote} This turns deprecated HitCollector into a Collector? Seems like it should be package private? {quote} Initially I wrote it but then deleted. I decided to make the decision as I create the patch. If this will be used only in IndexSearcher, then it should be a private static final class in IndexSearcher, otherwise a package private one. However, if it turns out we'd want to use it for now in other places too where we deprecate the HitCollector methods, then it will be public. Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible. {quote} This is deprecated, so we shouldn't add topDocs(start, howMany)? I think just switch it back to extending the deprecated TopDocCollector (like it does in 2.4)? {quote} That's a good idea. {quote} H good point. I would love to stop screening for 0 score in the core collectors (like Solr). Maybe we fix the core collectors to not screen by zero score, but we add a new only keep positive scores collector chain/wrapper class that does the filtering and the forwards collection to another collector? This way there's a migration path if somehow users are relying on this. {quote} I can do that. Create a FilterZeroScoresCollector which wraps a Collector and passes forward only documents with score 0. BTW, how can a document get a zero score? I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the
Re: NIO.2
On Sat, Mar 28, 2009 at 16:44, Michael Busch busch...@gmail.com wrote: NIO.2 sounds great. Though, it will probably take a pretty long time before we can switch Lucene to Java 1.7 :( We could write a (contrib) module that we don't ship together with the core that has a Directory implementation which uses NIO.2. http://jcp.org/en/jsr/detail?id=203 http://ronsoft.net/files/WhatsNewNIO2.pdf -Michael I was excited for a second, until I noticed they somehow lost Big*Buffers while merging into JDK7. http://download.java.net/jdk7/docs/api/java/nio/channels/package-summary.html - Ctrl+F, MappedBigByteBuffer, that's the only remnant of what was supposed to be there. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
AW: NIO.2
From the talk at ApacheCon I remember that NIO.2 will also be available for Java 6 as a addon package (a little bit later). Uwe Mit einem Mobiltelefon von Sony Ericsson gesendet Originalnachricht Von: Michael Busch busch...@gmail.com Gesendet: An: java-dev@lucene.apache.org Betreff: NIO.2 NIO.2 sounds great. Though, it will probably take a pretty long time before we can switch Lucene to Java 1.7 :( We could write a (contrib) module that we don't ship together with the core that has a Directory implementation which uses NIO.2. http://jcp.org/en/jsr/detail?id=203 http://ronsoft.net/files/WhatsNewNIO2.pdf -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693448#action_12693448 ] Marvin Humphrey commented on LUCENE-1575: - BTW, how can a document get a zero score? Any number of ways, since Query and Scorer are extensible. How about a RandomScoreQuery that uses floor(rand(1.9))? Or say that you have a bitset of docs which should match and you use that to feed a scorer. What score should you assign? Why not 0? Why not -1? Should it matter? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693462#action_12693462 ] Michael McCandless commented on LUCENE-1575: bq. I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit? OK that sounds great. The back-compat tests will also assert nothing broke. bq. Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible. OK. bq. BTW, how can a document get a zero score? I've wondered the same thing. There was this thread recently: http://www.nabble.com/TopDocCollector-td22244245.html Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries
Cloned SegmentReaders fail to share FieldCache entries -- Key: LUCENE-1579 URL: https://issues.apache.org/jira/browse/LUCENE-1579 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 I just hit this on LUCENE-1516, which returns a cloned readOnly readers from IndexWriter. The problem is, when cloning, we create a new [thin] cloned SegmentReader for each segment. FieldCache keys directly off this object, so if you clone the reader and do a search that requires the FieldCache (eg, sorting) then that first search is always very slow because every single segment is reloading the FieldCache. This is of course a complete showstopper for LUCENE-1516. With LUCENE-831 we'll switch to a new FieldCache API; we should ensure this bug is not present there. We should also fix the bug in the current FieldCache API since for 2.9, users may hit this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693465#action_12693465 ] Michael McCandless commented on LUCENE-1516: Disregard the search time in the above results... we have a sneaky bug (LUCENE-1579) that is causing FieldCache to not be re-used for shared segments in a reopened reader. This makes the search time after reopen far worse than it should be. Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693464#action_12693464 ] Michael McCandless commented on LUCENE-831: --- Let's make sure the new API fixes LUCENE-1579. Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: NIO.2
I think having async IO will be great, though I wonder how we would change Lucene to take advantage of it. It ought to gain us concurrency (eg we can score last chunk while we have an io request out to retrieve next chunk, of term docs / positions / etc.). Watch service sounds neat. Maybe we could use that as a way for readers to know when to reopen themselves. Something Lucene would really benefit from is access to madvise, so that we could tell the OS that the massive amounts of data we are reading writing for merging should not be cached. But it doesn't look like NIO2 has exposed this... Another thing would be control over the priority of multiple outstanding IO tasks. EG I'd like to say that the IO in a merge thread is lower priority than indexing thread, which is lower priority than searching threads. This is even further out (I don't think OSs expose this control?). Mike On Sat, Mar 28, 2009 at 9:44 AM, Michael Busch busch...@gmail.com wrote: NIO.2 sounds great. Though, it will probably take a pretty long time before we can switch Lucene to Java 1.7 :( We could write a (contrib) module that we don't ship together with the core that has a Directory implementation which uses NIO.2. http://jcp.org/en/jsr/detail?id=203 http://ronsoft.net/files/WhatsNewNIO2.pdf -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693469#action_12693469 ] Shai Erera commented on LUCENE-1575: After I posted the question on how can a document get a 0 score, I realized that it's possible due to extensions of Similarity for example. Thanks Marvin for clearing that up. I guess though that the Lucene core classes will not assign = 0 score to a document? Anyway, whether it's true or not, I think I agree with Mike saying we should remove this screening from the core collectors. If my application extends Lucene in a way that it can assign = 0 scores to documents, and it has the intention of screening those documents, it should use the new FilterZeroScoresCollector (maybe call it OnlyPositiveScoresCollector?) I don't think that assigning = 0 score to a document necessarily means it should be removed from the result set. However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on =0 score documents? I mean, what if my application relies on that and extended Lucene in a way that it sometimes assigns 0 scores to documents? Now when I'll switch to 2.9, those documents won't be filtered. I will be able to use the new FilterZeroScoresCollector, but that'll require me to change my app's code. Maybe just do it for the new collectors (TopScoreDocCollector and TopFieldCollector)? I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693470#action_12693470 ] Michael McCandless commented on LUCENE-1575: bq. However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on =0 score documents? Hmm right there is, because the search methods will use the new collectors. bq. I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine? Actually there's no change to your code required (the search methods should use the new collectors). So we do have a back-compat difference. We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the PositiveScoresOnlyCollector wrapper? I think the vast majority of users are not relying on = 0 scoring docs to be filtered out. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically
Re: NIO.2
I think having async IO will be great, though I wonder how we would change Lucene to take advantage of it. It ought to gain us concurrency (eg we can score last chunk while we have an io request out to retrieve next chunk, of term docs / positions / etc.). A presentation given above references Big*Buffers, including MappedBigByteBuffer, which differ from their not-so-Big counterparts in using long sizes/offsets. That means (woo-hoo!) a way better MMapDirectory. Everything else there is totally irrelevant to lucene. Okay, maybe atomic file moves, but that part of Directory is long time deprecated :) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693474#action_12693474 ] Shai Erera commented on LUCENE-1575: bq. We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the PositiveScoresOnlyCollector wrapper? Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right? I guess you are referring to the methods which don't take a collector as a parameter and instantiate a new TopScoreDocCollector internally? I tend to think that if someone uses those, it is just because they are simple, and I find it very hard to imagine that that someone relies on the filtering. So perhaps we can get away with just documenting the change in behavior? bq. I think the vast majority of users are not relying on = 0 scoring docs to be filtered out. I tend to agree. This has been around for quite some time. I checked my custom collectors, and they do the same check. I only now realize I just followed the code practice I saw in Lucene's code, never giving it much thought of whether this can actually happen. I believe that if I'd have extended Lucene in a way such that it returns =0 scores, I'd be aware of that and probably won't use the built-in collectors. I see no reason to filter = 0 scored docs anyway, and if I wanted that, I'd probably write my own filtering collector ... I think that if we don't believe people rely on the = 0 filtering, let's just document it. I'd hate to add a setter method to IndexSearcher, and a unit test, and check where else it should be added (i.e., in extending searcher classes) and introduce a new API which we might need to deprecate some day ... People who'll need that functionality can move to use the methods that accept a Collector, and pass in the PositiveScoresOnlyCollector. That way we also keep the 'fast and easy' search methods really simple, fast and easy. Is that acceptable? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score().
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693478#action_12693478 ] Michael McCandless commented on LUCENE-1575: bq. Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right? OK, I agree. Let's add an entry to the top of CHANGES.txt that states this [minor] break in back compatibility, as well as the code fragment showing how to use that filter to get back to the pre-2.9 way? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries
[ https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1579: --- Attachment: LUCENE-1579.patch Attached patch. I plan to commit in a day or two. I added a new deprecated expert public method to IndexReader: getFieldCacheWrapper(). Default impl is to return this, but SegmentReader overrides that and returns a wrapper class that forwards hashCode()/equals() to the underlying freqStream. Cloned SegmentReaders fail to share FieldCache entries -- Key: LUCENE-1579 URL: https://issues.apache.org/jira/browse/LUCENE-1579 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1579.patch I just hit this on LUCENE-1516, which returns a cloned readOnly readers from IndexWriter. The problem is, when cloning, we create a new [thin] cloned SegmentReader for each segment. FieldCache keys directly off this object, so if you clone the reader and do a search that requires the FieldCache (eg, sorting) then that first search is always very slow because every single segment is reloading the FieldCache. This is of course a complete showstopper for LUCENE-1516. With LUCENE-831 we'll switch to a new FieldCache API; we should ensure this bug is not present there. We should also fix the bug in the current FieldCache API since for 2.9, users may hit this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d
[ https://issues.apache.org/jira/browse/LUCENE-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1573. Resolution: Fixed Thanks Jeremy. IndexWriter does not do the right thing when a Thread is interrupt()'d -- Key: LUCENE-1573 URL: https://issues.apache.org/jira/browse/LUCENE-1573 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1573.patch Spinoff from here: http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html When a Thread is interrupt()'d while inside Lucene, there is a risk currently that it will cause a spinloop and starve BG merges from completing. Instead, when possible, we should allow interruption. But unfortunately for back-compat, we will need to wrap the exception in an unchecked version. In 3.0 we can change that to simply throw InterruptedException. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-652) Compressed fields should be externalized (from Fields into Document)
[ https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-652. --- Resolution: Fixed Compressed fields should be externalized (from Fields into Document) -- Key: LUCENE-652 URL: https://issues.apache.org/jira/browse/LUCENE-652 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.9, 2.0.0, 2.1 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-652.patch, LUCENE-652.patch, LUCENE-652.patch Right now, as of 2.0 release, Lucene supports compressed stored fields. However, after discussion on java-dev, the suggestion arose, from Robert Engels, that it would be better if this logic were moved into the Document level. This way the indexing level just stores opaque binary fields, and then Document handles compress/uncompressing as needed. This approach would have prevented issues like LUCENE-629 because merging of segments would never need to decompress. See this thread for the recent discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/38836 When we do this we should also work on related issue LUCENE-648. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly. -- Key: LUCENE-1580 URL: https://issues.apache.org/jira/browse/LUCENE-1580 Project: Lucene - Java Issue Type: Bug Reporter: Digy Priority: Minor Attachments: ISOLatin1AccentFilter.patch Below mappings are missing Ğ -- G ğ -- g İ -- I ı -- i Ş -- S ş -- s DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
[ https://issues.apache.org/jira/browse/LUCENE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Digy updated LUCENE-1580: - Attachment: ISOLatin1AccentFilter.patch ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly. -- Key: LUCENE-1580 URL: https://issues.apache.org/jira/browse/LUCENE-1580 Project: Lucene - Java Issue Type: Bug Reporter: Digy Priority: Minor Attachments: ISOLatin1AccentFilter.patch Below mappings are missing Ğ -- G ğ -- g İ -- I ı -- i Ş -- S ş -- s DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
[ https://issues.apache.org/jira/browse/LUCENE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andi Vajda resolved LUCENE-1580. Resolution: Duplicate See https://issues.apache.org/jira/browse/LUCENE-1390 ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly. -- Key: LUCENE-1580 URL: https://issues.apache.org/jira/browse/LUCENE-1580 Project: Lucene - Java Issue Type: Bug Reporter: Digy Priority: Minor Attachments: ISOLatin1AccentFilter.patch Below mappings are missing Ğ -- G ğ -- g İ -- I ı -- i Ş -- S ş -- s DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries
[ https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1579: --- Attachment: LUCENE-1579.patch New patch. The last one was causing entries in FieldCache to get booted too soon. Cloned SegmentReaders fail to share FieldCache entries -- Key: LUCENE-1579 URL: https://issues.apache.org/jira/browse/LUCENE-1579 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1579.patch, LUCENE-1579.patch I just hit this on LUCENE-1516, which returns a cloned readOnly readers from IndexWriter. The problem is, when cloning, we create a new [thin] cloned SegmentReader for each segment. FieldCache keys directly off this object, so if you clone the reader and do a search that requires the FieldCache (eg, sorting) then that first search is always very slow because every single segment is reloading the FieldCache. This is of course a complete showstopper for LUCENE-1516. With LUCENE-831 we'll switch to a new FieldCache API; we should ensure this bug is not present there. We should also fix the bug in the current FieldCache API since for 2.9, users may hit this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1516: --- Attachment: LUCENE-1516.patch New patch: Fixed a few small issues... and made some changes to contrib/benchmark to help in running more realistic near real-time tests: * Fixed LineDocMaker to properly set docid primary key field. * Added UpdateDocTask that calls IndexWriter.updateDocument, randomly picking a docid. * Added NearRealTimeReader task, that creates BG thread that every N seconds opens a reader, runs a static search and prints results. This patch also contains patch from LUCENE-1579. So, using this you can 1) create a large index, 2) create an alg that does doc updates at a fixed rate and then tests the near real-time reader performance. Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Possible IndexInput optimization
While drooling over MappedBigByteBuffer, which we'll (hopefully) see in JDK7, I revisited my own Directory code and noticed a certain peculiarity, shared by Lucene core classes: Each and every IndexInput implementation only implements readByte() and readBytes(), never trying to override readInt/VInt/Long/etc methods. Currently RAMDirectory uses a list of byte arrays as a backing store, and I got some speedup when switched to custom version that knows each file size beforehand and thus is able to allocate a single byte array (deliberately accepting 2Gb file size limitation) of exactly needed length. Nothing strange here, readByte(s) methods are easily most oft called ones in a Lucene app and they were greatly simplified - readByte became mere: public byte readByte() throws IOException { return buffer[position++]; // I dropped bounds checking, relying on natural ArrayIndexOOBE, we can't easily catch and recover from it anyway } But now, readInt is four readByte calls, readLong is two readInts (ten calls in total), readString - god knows how many. Unless you use a single type of Directory through the lifetime of your application, these readByte calls are never inlined, JIT invokevirtual short-circuit optimization (it skips method lookup if it always finds the same one during this exact invocation) cannot be applied too. There are three cases when we can override readNNN methods and provide implementations with zero or minimum method invocations - RAMDirectory, MMapDirectory and BufferedIndexInput for FSDirectory/CompoundFileReader. Anybody tried this? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693512#action_12693512 ] Shai Erera commented on LUCENE-1581: I guess we were telepathying or something because I reviewed LowerCaseFilter 2 days ago for the same reason :) Thing is, in Java Character.toLowerCase does not accept a Locale, just char. Unlike String which has two variants for toLowerCase and toUpperCase, that accept in addition to the String, a Locale parameter. I believe that Character.toLowerCase in Java works ok, since it's based on the UNICODE spec (at least it writes so) - however I have to admit I haven't tested this character specifically. LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693513#action_12693513 ] Shai Erera commented on LUCENE-1575: Great ! Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org