[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725164#action_12725164 ] Mark Harwood commented on LUCENE-1720: -- Currently the class hinges on a fast fail mechanism whereby all the many calls checking for a timeout are very quickly testing a single volatile boolean, anActivityHasTimedOut. 99.99% of calls are expected to fail this test (nothing has timed out) and fail quickly - I was reluctant to add any hashset lookup etc in there needed to determine failure. With that as a guiding principle maybe the solution is to change volatile boolean anActivityHasTimedOut into volatile int numberOfTimedOutThreads; which would cater for 1 error condition at once. The fast-fail check then becomes: if(numberOfTimedOutThreads 0) { if(timedoutThreads.contains(Thread.currentThread) { timedoutThreads.remove(Thread.currentThread); numberOfTimedOutThreads=timedoutThreads.size(); throw RuntimeException. } } TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725168#action_12725168 ] Eks Dev commented on LUCENE-1720: - it's been late for this issue, but maybe worth thinking about. We could change semantics of this problem completely. Imo, the problem can be reformulated as Provide possibility to cancel running queries on best effort basis, with or without providing so far collected results That would leave Timer management to the end users and make an issue focus on one Lucene core ... Timeout management can be then provided as an example somewhere How to implement Timeout management using ... TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725172#action_12725172 ] Shai Erera commented on LUCENE-1720: bq. ... quickly testing a single volatile boolean, anActivityHasTimedOut. Oh, I did not mean to skip this check. After anActivityHasTimedOut is true, instead of comparing Thread.currentThread() to firstAnticipatedThreadToFail, check if Thread.currentThread() is in the failed HashSet of threads, or something like that. I totally agree this should be kept and used that way, and it's probably better than numberOfTimedOutThreads since we don't need to inc/dec the latter every failure, just set a boolean flag and test it. bq. Imo, the problem can be reformulated as Provide possibility to cancel running queries on best effort basis, with or without providing so far collected results. That's where we started from, but Mark here wanted to provide a much more generalized way of stopping any other activity, not just search. With this utility class, someone can implement a TimeLimitedIndexWriter which times out indexing, merging etc. Search is just one operation which will be covered as well. I also think that TimeLimitingCollector already provides a possibility to cancel running queries on a best effort basis and therefore if someone is interested in just that, he doesn't need to use TimeLimitedIndexReader. However this approach seems much more simple if you want to ensure queries are stopped ASAP, w/o passing a Timeout object around or anything. This approach also guarantees (I think) that any custom Scorer which does a lot of work, but uses IndexReader for that, will be stopped, even if the Scorer's developer did not implement a Timeout mechanism. Right? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725176#action_12725176 ] Mark Harwood commented on LUCENE-1720: -- bq. Oh, I did not mean to skip this check. But the check is on a variable with a yes/no state. We need to cater for 1 simultaneous timeout error condition in play. With only a boolean it could be hard to know precisely when to clear it, no? bq. Mark here wanted to provide a much more generalized way of stopping any other activity, not just search To be fair I think the use case for IndexWriter is weaker. In reader you have multiple users all expressing different queries and you want them all to share nicely with each other. In index writing it's typically a batch system indexing docs and there's no fairness to mediate? Breaking it out into a utility class seems like a good idea anyway. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1722) SmartChineseAnalyzer javadoc improvement
SmartChineseAnalyzer javadoc improvement Key: LUCENE-1722 URL: https://issues.apache.org/jira/browse/LUCENE-1722 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Chinese - English, and corrections to match reality (removes several javadoc warnings) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725182#action_12725182 ] Eks Dev commented on LUCENE-1720: - Sure, I just wanted to sharpen definition what is Lucene core issue, and what we can leave to end users. It is not only about the time, rather about canceling search requests (even better, general activities). TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725183#action_12725183 ] Shai Erera commented on LUCENE-1720: bq. With only a boolean it could be hard to know precisely when to clear it, no? We can cleat it when the time out threads' Set's size() is 0? I agree that this issue is mostly about IndexReader (and hence the name), and that the scenario of IndexWriter is weaker. But a utility class together w/ the TimeLimitedIndexReader example can help someone write a TimeLimitedIndexWriter very easily, and/or reuse this utility elsewhere. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1722) SmartChineseAnalyzer javadoc improvement
[ https://issues.apache.org/jira/browse/LUCENE-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1722: Attachment: LUCENE-1722.txt patch file SmartChineseAnalyzer javadoc improvement Key: LUCENE-1722 URL: https://issues.apache.org/jira/browse/LUCENE-1722 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1722.txt Chinese - English, and corrections to match reality (removes several javadoc warnings) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725197#action_12725197 ] Mark Harwood commented on LUCENE-1720: -- bq. any custom Scorer which does a lot of work, but uses IndexReader for that, will be stopped, even if the Scorer's developer did not implement a Timeout mechanism. Right? Correct. I'm not familiar with the proposal to pass around a Timeout object but I get the idea and the code here would certainly avoid that overhead. bq. We can cleat it when the time out threads' Set's size() is 0? Yes, that would work. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725200#action_12725200 ] Shai Erera commented on LUCENE-1720: bq. I'm not familiar with the proposal to pass around a Timeout object On the email thread I offered to create on QueryWeight a scorer(IndexSearcher, boolean, boolean, Timeout) in order to pass a Timeout object to Scorer, and also create a TimeLimitedQuery. But it's no longer needed. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith reopened LUCENE-1705: --- Looks like i found an issue with this The deleteAll() method isn't resetting the nextDocID on the DocumentsWriter (or some similar behaviour) so, the following state will result in an error: * deleteAll() * updateDocument(5, doc) * commit() this results in a delete for doc 5 getting buffered, but with a very high maxDocId at the same time, doc is added, however, the following will then occur on commit: * flush segments to disk * doc 5 is now in a segment on disk * run deletes * doc 5 is now blacklisted from segment Will work on fixing this and post a new patch (along with updated test case) (was worried i was missing an edge case) Add deleteAllDocuments() method to IndexWriter -- Key: LUCENE-1705 URL: https://issues.apache.org/jira/browse/LUCENE-1705 Project: Lucene - Java Issue Type: Wish Components: Index Affects Versions: 2.4 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 2.9 Attachments: IndexWriterDeleteAll.patch, LUCENE-1705.patch Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter This method should have the same performance and characteristics as: * currentWriter.close() * currentWriter = new IndexWriter(..., create=true,...) This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index. IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index) I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the clear as this seems to be the general recommendation for working with an IndexWriter now deleteAllDocuments() method should: * abort any background merges (they are pointless once a deleteAll has been received) * write new segments file referencing no segments This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1706) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/LUCENE-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll closed LUCENE-1706. --- Resolution: Fixed Lucene Fields: (was: [New]) Site search powered by Lucene/Solr -- Key: LUCENE-1706 URL: https://issues.apache.org/jira/browse/LUCENE-1706 Project: Lucene - Java Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1706.patch, LUCENE-1706.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Lucene content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org You can see it live on Mahout, Tika and Solr Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Lucene site to search Lucene only content using Lucene/Solr. When a search is submitted, it automatically selects the Lucene facet such that only Lucene content is searched. From there, users can then narrow/broaden their search criteria. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: TestIndexWriterDelete.patch Here's a patch to TestIndexWriterDelete that shows the problem after the deleteAll(), a document is added and a document is updated the added document gets indexed, the updated document does not Add deleteAllDocuments() method to IndexWriter -- Key: LUCENE-1705 URL: https://issues.apache.org/jira/browse/LUCENE-1705 Project: Lucene - Java Issue Type: Wish Components: Index Affects Versions: 2.4 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 2.9 Attachments: IndexWriterDeleteAll.patch, LUCENE-1705.patch, TestIndexWriterDelete.patch Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter This method should have the same performance and characteristics as: * currentWriter.close() * currentWriter = new IndexWriter(..., create=true,...) This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index. IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index) I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the clear as this seems to be the general recommendation for working with an IndexWriter now deleteAllDocuments() method should: * abort any background merges (they are pointless once a deleteAll has been received) * write new segments file referencing no segments This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1566: Attachment: LUCENE-1566.patch I was able to reproduce the bug on my machine using several JVMs. The attached patch is what I got ready by now - I though I get it out there as soon as possible for discussion. Test pass on my side! Large Lucene index can hit false OOM due to Sun JRE issue - Key: LUCENE-1566 URL: https://issues.apache.org/jira/browse/LUCENE-1566 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-1566.patch This is not a Lucene issue, but I want to open this so future google diggers can more easily find it. There's this nasty bug in Sun's JRE: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 The gist seems to be, if you try to read a large (eg 200 MB) number of bytes during a single RandomAccessFile.read call, you can incorrectly hit OOM. Lucene does this, with norms, since we read in one byte per doc per field with norms, as a contiguous array of length maxDoc(). The workaround was a custom patch to do large file reads as several smaller reads. Background here: http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: DeleteAllFlushDocCountFix.patch Here's a patch that fixes the deleteAll() + updateDocument() issue just needed to set the FlushDocCount to 0 after aborting the outstanding documents Add deleteAllDocuments() method to IndexWriter -- Key: LUCENE-1705 URL: https://issues.apache.org/jira/browse/LUCENE-1705 Project: Lucene - Java Issue Type: Wish Components: Index Affects Versions: 2.4 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 2.9 Attachments: DeleteAllFlushDocCountFix.patch, IndexWriterDeleteAll.patch, LUCENE-1705.patch Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter This method should have the same performance and characteristics as: * currentWriter.close() * currentWriter = new IndexWriter(..., create=true,...) This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index. IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index) I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the clear as this seems to be the general recommendation for working with an IndexWriter now deleteAllDocuments() method should: * abort any background merges (they are pointless once a deleteAll has been received) * write new segments file referencing no segments This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: (was: TestIndexWriterDelete.patch) Add deleteAllDocuments() method to IndexWriter -- Key: LUCENE-1705 URL: https://issues.apache.org/jira/browse/LUCENE-1705 Project: Lucene - Java Issue Type: Wish Components: Index Affects Versions: 2.4 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 2.9 Attachments: DeleteAllFlushDocCountFix.patch, IndexWriterDeleteAll.patch, LUCENE-1705.patch Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter This method should have the same performance and characteristics as: * currentWriter.close() * currentWriter = new IndexWriter(..., create=true,...) This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index. IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index) I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the clear as this seems to be the general recommendation for working with an IndexWriter now deleteAllDocuments() method should: * abort any background merges (they are pointless once a deleteAll has been received) * write new segments file referencing no segments This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725386#action_12725386 ] Jason Rutherglen commented on LUCENE-1720: -- Maybe we can benchmark this approach to see if it slows down queries due to the the Thread.currentThread and hash lookup? As this would go into 3.0 (?) maybe we can look at how to change the Lucene API such that we pass in an argument to the IndexReader methods where the timeout may be checked for? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Attachment: AnalyzerBug.java KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Description: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } was: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Description: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } was: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please
[jira] Commented: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString
[ https://issues.apache.org/jira/browse/LUCENE-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725447#action_12725447 ] David Smiley commented on LUCENE-1653: -- I'm looking through DateTools now and can't help but want to clean it up some. One thing I see that is odd is the use of a Calendar in timeToString(long,resolution). The first two lines look like this right now: {code} calInstance.setTimeInMillis(round(time, resolution)); Date date = calInstance.getTime(); {code} Instead, it can simply be: {code} Date date = new Date(round(time, resolution)); {code}. Secondly... I think a good deal of logic can be cleaned up in the other methods instead of a bunch of if-else statements that is a bad code smell. Most of the logic of 3 of those methods could be put into Resolution and be made tighter. Change DateTools to not create a Calendar in every call to dateToString or timeToString --- Key: LUCENE-1653 URL: https://issues.apache.org/jira/browse/LUCENE-1653 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Shai Erera Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: LUCENE-1653.patch, LUCENE-1653.patch DateTools creates a Calendar instance on every call to dateToString and timeToString. Specifically: # timeToString calls Calendar.getInstance on every call. # dateToString calls timeToString(date.getTime()), which then instantiates a new Date(). I think we should change the order of the calls, or not have each call the other. # round(), which is called from timeToString (after creating a Calendar instance) creates another (!) Calendar instance ... Seems that if we synchronize the methods and create the Calendar instance once (static), it should solve it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725448#action_12725448 ] Robert Muir commented on LUCENE-1723: - Dima, have you tried your test against the latest lucene trunk? I got these results: {noformat} org.apache.lucene.analysis.standard.StandardAnalyzer passed, value highlighted: bthetext/b org.apache.lucene.analysis.SimpleAnalyzer passed, value highlighted: bthetext/b org.apache.lucene.analysis.StopAnalyzer passed, value highlighted: bthetext/b org.apache.lucene.analysis.WhitespaceAnalyzer passed, value highlighted: bthetext/b org.apache.lucene.analysis.NewKeywordAnalyzer passed, value highlighted: bthetext/b org.apache.lucene.analysis.KeywordAnalyzer passed, value highlighted: bthetext/b {noformat} maybe you can verify the same? KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public
[jira] Commented: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString
[ https://issues.apache.org/jira/browse/LUCENE-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725456#action_12725456 ] Shai Erera commented on LUCENE-1653: In 3.0 when we move to Java 5, we can make Resolution an enum, and then use a switch statement on passed in Resolution. But performance-wise I don't think it would make such a big difference, as we're already comparing instances, which should be relatively fast. How will moving the logic of timeToString, stringToDate and round to Resolution make the code tighter? Resolution would still need to check its instance type in order to execute the right code. Unless we subclass Resolution internally and have each subclass implement just the code section of these 3, that it needs? Change DateTools to not create a Calendar in every call to dateToString or timeToString --- Key: LUCENE-1653 URL: https://issues.apache.org/jira/browse/LUCENE-1653 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Shai Erera Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: LUCENE-1653.patch, LUCENE-1653.patch DateTools creates a Calendar instance on every call to dateToString and timeToString. Specifically: # timeToString calls Calendar.getInstance on every call. # dateToString calls timeToString(date.getTime()), which then instantiates a new Date(). I think we should change the order of the calls, or not have each call the other. # round(), which is called from timeToString (after creating a Calendar instance) creates another (!) Calendar instance ... Seems that if we synchronize the methods and create the Calendar instance once (static), it should solve it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725460#action_12725460 ] Dima May commented on LUCENE-1723: -- Verified! You are absolutely correct, the bug has been fixed on the latest trunk. The next method in the KeywordTokenizer now sets the start and end offsets: reusableToken.setStartOffset(input.correctOffset(0)) reusableToken.setEndOffset(input.correctOffset(upto)); I will resolve and close the ticket. Sorry for the trouble and thank you for the prompt attention. KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); }
[jira] Resolved: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May resolved LUCENE-1723. -- Resolution: Fixed Fix Version/s: 2.9 KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Fix For: 2.9 Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } -- This message is automatically generated by JIRA. - You can reply to this email
[jira] Closed: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May closed LUCENE-1723. KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Fix For: 2.9 Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as bthetext/b. When using KeywordAnalyzer the tags appear before the text, for example: b/bthetext. Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = thetext; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( b, /b), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, CONTENT, text); assertEquals(Failed for + analazer.getClass().getName(), b + text + /b, highighted); System.out.println(analazer.getClass().getName() + passed, value highlighted: + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.