[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725500#action_12725500 ] Mark Harwood commented on LUCENE-1720: -- bq. Maybe we can benchmark this approach See http://www.nabble.com/Improving-TimeLimitedCollector-td24174758.html#a24229185 The figures were produced by TestTimeLimitedIndexReader that is part of this Jira issue so you can try benchmarks on your own indexes. bq.if it slows down queries due to the the Thread.currentThread and hash lookup This lookup only happens when threads start or stop timed activities and when there is a timed out state - all other method invocations on TimeLimitedIndexReader eg termDocs.next() are simply testing a volatile boolean which is used to indicate if any timeout has occurred. This seems to be fast in my benchmarks. bq. maybe we can .. change the Lucene API such that we pass in an argument to the IndexReader methods where the timeout may be checked The current design uses static methods which remove the need to pass a timeout object as context everywhere but using this approach comes with the downside that a single client thread is unable to time >1 activity at once which we thought was a reasonable trade-off. See http://www.nabble.com/Re%3A-Improving-TimeLimitedCollector-p24234976.html > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May closed LUCENE-1723. > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Fix For: 2.9 > > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader); > return tokenizer; > } > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NewKeywordTokenizer(reader); > } > } > class NewKeywordTokenizer extends KeywordTokenizer { > public NewKeywordTokenizer(Reader input) { > super(input); > } > @Override > public Token next(Token t) throws IOException { > Token result = super.next(t); > if (result != null) { > result.setEndOffset(result.termLength()); > } > return result; > } > } -- This message is
[jira] Resolved: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May resolved LUCENE-1723. -- Resolution: Fixed Fix Version/s: 2.9 > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Fix For: 2.9 > > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader); > return tokenizer; > } > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NewKeywordTokenizer(reader); > } > } > class NewKeywordTokenizer extends KeywordTokenizer { > public NewKeywordTokenizer(Reader input) { > super(input); > } > @Override > public Token next(Token t) throws IOException { > Token result = super.next(t); > if (result != null) { > result.setEndOffset(result.termLength()); > } >
[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725460#action_12725460 ] Dima May commented on LUCENE-1723: -- Verified! You are absolutely correct, the bug has been fixed on the latest trunk. The next method in the KeywordTokenizer now sets the start and end offsets: reusableToken.setStartOffset(input.correctOffset(0)) reusableToken.setEndOffset(input.correctOffset(upto)); I will resolve and close the ticket. Sorry for the trouble and thank you for the prompt attention. > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader); > return tokenizer; > } > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NewKeywordTokenizer(reader); > } > } > class NewKeywordTokenizer extends KeywordToke
[jira] Commented: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString
[ https://issues.apache.org/jira/browse/LUCENE-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725456#action_12725456 ] Shai Erera commented on LUCENE-1653: In 3.0 when we move to Java 5, we can make Resolution an enum, and then use a switch statement on passed in Resolution. But performance-wise I don't think it would make such a big difference, as we're already comparing instances, which should be relatively fast. How will moving the logic of timeToString, stringToDate and round to Resolution make the code tighter? Resolution would still need to check its instance type in order to execute the right code. Unless we subclass Resolution internally and have each subclass implement just the code section of these 3, that it needs? > Change DateTools to not create a Calendar in every call to dateToString or > timeToString > --- > > Key: LUCENE-1653 > URL: https://issues.apache.org/jira/browse/LUCENE-1653 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Shai Erera >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1653.patch, LUCENE-1653.patch > > > DateTools creates a Calendar instance on every call to dateToString and > timeToString. Specifically: > # timeToString calls Calendar.getInstance on every call. > # dateToString calls timeToString(date.getTime()), which then instantiates a > new Date(). I think we should change the order of the calls, or not have each > call the other. > # round(), which is called from timeToString (after creating a Calendar > instance) creates another (!) Calendar instance ... > Seems that if we synchronize the methods and create the Calendar instance > once (static), it should solve it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725448#action_12725448 ] Robert Muir commented on LUCENE-1723: - Dima, have you tried your test against the latest lucene trunk? I got these results: {noformat} org.apache.lucene.analysis.standard.StandardAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.SimpleAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.StopAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.WhitespaceAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.NewKeywordAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.KeywordAnalyzer passed, value highlighted: thetext {noformat} maybe you can verify the same? > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader);
[jira] Commented: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString
[ https://issues.apache.org/jira/browse/LUCENE-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725447#action_12725447 ] David Smiley commented on LUCENE-1653: -- I'm looking through DateTools now and can't help but want to clean it up some. One thing I see that is odd is the use of a Calendar in timeToString(long,resolution). The first two lines look like this right now: {code} calInstance.setTimeInMillis(round(time, resolution)); Date date = calInstance.getTime(); {code} Instead, it can simply be: {code} Date date = new Date(round(time, resolution)); {code}. Secondly... I think a good deal of logic can be cleaned up in the other methods instead of a bunch of if-else statements that is a bad code smell. Most of the logic of 3 of those methods could be put into Resolution and be made tighter. > Change DateTools to not create a Calendar in every call to dateToString or > timeToString > --- > > Key: LUCENE-1653 > URL: https://issues.apache.org/jira/browse/LUCENE-1653 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Shai Erera >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1653.patch, LUCENE-1653.patch > > > DateTools creates a Calendar instance on every call to dateToString and > timeToString. Specifically: > # timeToString calls Calendar.getInstance on every call. > # dateToString calls timeToString(date.getTime()), which then instantiates a > new Date(). I think we should change the order of the calls, or not have each > call the other. > # round(), which is called from timeToString (after creating a Calendar > instance) creates another (!) Calendar instance ... > Seems that if we synchronize the methods and create the Calendar instance > once (static), it should solve it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Description: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as "thetext". When using KeywordAnalyzer the tags appear before the text, for example: "thetext". Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. Unless there is an objection I will gladly post a patch in the very near future . - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = "thetext"; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( "", ""), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, "CONTENT", text); assertEquals("Failed for " + analazer.getClass().getName(), "" + text + "", highighted); System.out.println(analazer.getClass().getName() + " passed, value highlighted: " + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } was: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as "thetext". When using KeywordAnalyzer the tags appear before the text, for example: "thetext". Pl
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Description: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test (source is also attached) that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as "thetext". When using KeywordAnalyzer the tags appear before the text, for example: "thetext". Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = "thetext"; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( "", ""), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, "CONTENT", text); assertEquals("Failed for " + analazer.getClass().getName(), "" + text + "", highighted); System.out.println(analazer.getClass().getName() + " passed, value highlighted: " + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } was: KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as "thetext". When using KeywordAnalyzer the tags appear before the text, for example: "thetext". Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are
[jira] Updated: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dima May updated LUCENE-1723: - Attachment: AnalyzerBug.java > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test that exercises various analyzers via a Highlighter > instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text > with the highlight tags, such as "thetext". When using KeywordAnalyzer > the tags appear before the text, for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader); > return tokenizer; > } > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NewKeywordTokenizer(reader); > } > } > class NewKeywordTokenizer extends KeywordTokenizer { > public NewKeywordTokenizer(Reader input) { > super(input); > } > @Override > public Token next(Token t) throws IOException { > Token result = super.next(t); > if (result != null) { > result.setEndOffset(result.termLength()); > } > return result; > } > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
KeywordTokenizer does not properly set the end offset - Key: LUCENE-1723 URL: https://issues.apache.org/jira/browse/LUCENE-1723 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Dima May Priority: Minor Attachments: AnalyzerBug.java KeywordTokenizer sets the Token's term length attribute but appears to omit the end offset. The issue was discovered while using a highlighter with the KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating the bug. Below is a JUnit test that exercises various analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer successfully wraps the text with the highlight tags, such as "thetext". When using KeywordAnalyzer the tags appear before the text, for example: "thetext". Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When using NewKeywordAnalyzer the tags are properly placed around the text. The NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting the end offset for the returned Token. NewKeywordAnalyzer utilizes KeywordTokenizer to produce proper token. - package lucene; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.KeywordTokenizer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.WeightedTerm; import org.junit.Test; import static org.junit.Assert.*; public class AnalyzerBug { @Test public void testWithHighlighting() throws IOException { String text = "thetext"; WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter( "", ""), new QueryScorer(terms)); Analyzer[] analazers = { new StandardAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new WhitespaceAnalyzer(), new NewKeywordAnalyzer(), new KeywordAnalyzer() }; // Analyzers pass except KeywordAnalyzer for (Analyzer analazer : analazers) { String highighted = highlighter.getBestFragment(analazer, "CONTENT", text); assertEquals("Failed for " + analazer.getClass().getName(), "" + text + "", highighted); System.out.println(analazer.getClass().getName() + " passed, value highlighted: " + highighted); } } } class NewKeywordAnalyzer extends KeywordAnalyzer { @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new NewKeywordTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NewKeywordTokenizer(reader); } } class NewKeywordTokenizer extends KeywordTokenizer { public NewKeywordTokenizer(Reader input) { super(input); } @Override public Token next(Token t) throws IOException { Token result = super.next(t); if (result != null) { result.setEndOffset(result.termLength()); } return result; } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725386#action_12725386 ] Jason Rutherglen commented on LUCENE-1720: -- Maybe we can benchmark this approach to see if it slows down queries due to the the Thread.currentThread and hash lookup? As this would go into 3.0 (?) maybe we can look at how to change the Lucene API such that we pass in an argument to the IndexReader methods where the timeout may be checked for? > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: (was: TestIndexWriterDelete.patch) > Add deleteAllDocuments() method to IndexWriter > -- > > Key: LUCENE-1705 > URL: https://issues.apache.org/jira/browse/LUCENE-1705 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Affects Versions: 2.4 >Reporter: Tim Smith >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: DeleteAllFlushDocCountFix.patch, > IndexWriterDeleteAll.patch, LUCENE-1705.patch > > > Ideally, there would be a deleteAllDocuments() or clear() method on the > IndexWriter > This method should have the same performance and characteristics as: > * currentWriter.close() > * currentWriter = new IndexWriter(..., create=true,...) > This would greatly optimize a delete all documents case. Using > deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large > existing index. > IndexWriter.deleteAllDocuments() should have the same semantics as a > commit(), as far as index visibility goes (new IndexReader opening would get > the empty index) > I see this was previously asked for in LUCENE-932, however it would be nice > to finally see this added such that the IndexWriter would not need to be > closed to perform the "clear" as this seems to be the general recommendation > for working with an IndexWriter now > deleteAllDocuments() method should: > * abort any background merges (they are pointless once a deleteAll has been > received) > * write new segments file referencing no segments > This method would remove one of the final reasons i would ever need to close > an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: DeleteAllFlushDocCountFix.patch Here's a patch that fixes the deleteAll() + updateDocument() issue just needed to set the FlushDocCount to 0 after aborting the outstanding documents > Add deleteAllDocuments() method to IndexWriter > -- > > Key: LUCENE-1705 > URL: https://issues.apache.org/jira/browse/LUCENE-1705 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Affects Versions: 2.4 >Reporter: Tim Smith >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: DeleteAllFlushDocCountFix.patch, > IndexWriterDeleteAll.patch, LUCENE-1705.patch > > > Ideally, there would be a deleteAllDocuments() or clear() method on the > IndexWriter > This method should have the same performance and characteristics as: > * currentWriter.close() > * currentWriter = new IndexWriter(..., create=true,...) > This would greatly optimize a delete all documents case. Using > deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large > existing index. > IndexWriter.deleteAllDocuments() should have the same semantics as a > commit(), as far as index visibility goes (new IndexReader opening would get > the empty index) > I see this was previously asked for in LUCENE-932, however it would be nice > to finally see this added such that the IndexWriter would not need to be > closed to perform the "clear" as this seems to be the general recommendation > for working with an IndexWriter now > deleteAllDocuments() method should: > * abort any background merges (they are pointless once a deleteAll has been > received) > * write new segments file referencing no segments > This method would remove one of the final reasons i would ever need to close > an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue
[ https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1566: Attachment: LUCENE-1566.patch I was able to reproduce the bug on my machine using several JVMs. The attached patch is what I got ready by now - I though I get it out there as soon as possible for discussion. Test pass on my side! > Large Lucene index can hit false OOM due to Sun JRE issue > - > > Key: LUCENE-1566 > URL: https://issues.apache.org/jira/browse/LUCENE-1566 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Simon Willnauer >Priority: Minor > Attachments: LUCENE-1566.patch > > > This is not a Lucene issue, but I want to open this so future google > diggers can more easily find it. > There's this nasty bug in Sun's JRE: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 > The gist seems to be, if you try to read a large (eg 200 MB) number of > bytes during a single RandomAccessFile.read call, you can incorrectly > hit OOM. Lucene does this, with norms, since we read in one byte per > doc per field with norms, as a contiguous array of length maxDoc(). > The workaround was a custom patch to do large file reads as several > smaller reads. > Background here: > http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1705: -- Attachment: TestIndexWriterDelete.patch Here's a patch to TestIndexWriterDelete that shows the problem after the deleteAll(), a document is added and a document is updated the added document gets indexed, the updated document does not > Add deleteAllDocuments() method to IndexWriter > -- > > Key: LUCENE-1705 > URL: https://issues.apache.org/jira/browse/LUCENE-1705 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Affects Versions: 2.4 >Reporter: Tim Smith >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: IndexWriterDeleteAll.patch, LUCENE-1705.patch, > TestIndexWriterDelete.patch > > > Ideally, there would be a deleteAllDocuments() or clear() method on the > IndexWriter > This method should have the same performance and characteristics as: > * currentWriter.close() > * currentWriter = new IndexWriter(..., create=true,...) > This would greatly optimize a delete all documents case. Using > deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large > existing index. > IndexWriter.deleteAllDocuments() should have the same semantics as a > commit(), as far as index visibility goes (new IndexReader opening would get > the empty index) > I see this was previously asked for in LUCENE-932, however it would be nice > to finally see this added such that the IndexWriter would not need to be > closed to perform the "clear" as this seems to be the general recommendation > for working with an IndexWriter now > deleteAllDocuments() method should: > * abort any background merges (they are pointless once a deleteAll has been > received) > * write new segments file referencing no segments > This method would remove one of the final reasons i would ever need to close > an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1706) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/LUCENE-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll closed LUCENE-1706. --- Resolution: Fixed Lucene Fields: (was: [New]) > Site search powered by Lucene/Solr > -- > > Key: LUCENE-1706 > URL: https://issues.apache.org/jira/browse/LUCENE-1706 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1706.patch, LUCENE-1706.patch > > > For a number of years now, the Lucene community has been criticized for not > eating our own "dog food" when it comes to search. My company has built and > hosts a site search (http://www.lucidimagination.com/search) that is powered > by Apache Solr and Lucene and we'd like to donate it's use to the Lucene > community. Additionally, it allows one to search all of the Lucene content > from a single place, including web, wiki, JIRA and mail archives. See also > http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org > You can see it live on Mahout, Tika and Solr > Lucid has a fault tolerant setup with replication and fail over as well as > monitoring services in place. We are committed to maintaining and expanding > the search capabilities on the site. > The following patch adds a skin to the Forrest site that enables the Lucene > site to search Lucene only content using Lucene/Solr. When a search is > submitted, it automatically selects the Lucene facet such that only Lucene > content is searched. From there, users can then narrow/broaden their search > criteria. > I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith reopened LUCENE-1705: --- Looks like i found an issue with this The deleteAll() method isn't resetting the nextDocID on the DocumentsWriter (or some similar behaviour) so, the following state will result in an error: * deleteAll() * updateDocument("5", doc) * commit() this results in a delete for doc "5" getting buffered, but with a very high "maxDocId" at the same time, doc is added, however, the following will then occur on commit: * flush segments to disk * doc "5" is now in a segment on disk * run deletes * doc "5" is now blacklisted from segment Will work on fixing this and post a new patch (along with updated test case) (was worried i was missing an edge case) > Add deleteAllDocuments() method to IndexWriter > -- > > Key: LUCENE-1705 > URL: https://issues.apache.org/jira/browse/LUCENE-1705 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Affects Versions: 2.4 >Reporter: Tim Smith >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: IndexWriterDeleteAll.patch, LUCENE-1705.patch > > > Ideally, there would be a deleteAllDocuments() or clear() method on the > IndexWriter > This method should have the same performance and characteristics as: > * currentWriter.close() > * currentWriter = new IndexWriter(..., create=true,...) > This would greatly optimize a delete all documents case. Using > deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large > existing index. > IndexWriter.deleteAllDocuments() should have the same semantics as a > commit(), as far as index visibility goes (new IndexReader opening would get > the empty index) > I see this was previously asked for in LUCENE-932, however it would be nice > to finally see this added such that the IndexWriter would not need to be > closed to perform the "clear" as this seems to be the general recommendation > for working with an IndexWriter now > deleteAllDocuments() method should: > * abort any background merges (they are pointless once a deleteAll has been > received) > * write new segments file referencing no segments > This method would remove one of the final reasons i would ever need to close > an IndexWriter and reopen a new one -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725200#action_12725200 ] Shai Erera commented on LUCENE-1720: bq. I'm not familiar with the proposal to pass around a Timeout object On the email thread I offered to create on QueryWeight a scorer(IndexSearcher, boolean, boolean, Timeout) in order to pass a Timeout object to Scorer, and also create a TimeLimitedQuery. But it's no longer needed. > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725197#action_12725197 ] Mark Harwood commented on LUCENE-1720: -- bq. any custom Scorer which does a lot of work, but uses IndexReader for that, will be stopped, even if the Scorer's developer did not implement a Timeout mechanism. Right? Correct. I'm not familiar with the proposal to pass around a Timeout object but I get the idea and the code here would certainly avoid that overhead. bq. We can cleat it when the time out threads' Set's size() is 0? Yes, that would work. > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1722) SmartChineseAnalyzer javadoc improvement
[ https://issues.apache.org/jira/browse/LUCENE-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1722: Attachment: LUCENE-1722.txt patch file > SmartChineseAnalyzer javadoc improvement > > > Key: LUCENE-1722 > URL: https://issues.apache.org/jira/browse/LUCENE-1722 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-1722.txt > > > Chinese -> English, and corrections to match reality (removes several javadoc > warnings) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725183#action_12725183 ] Shai Erera commented on LUCENE-1720: bq. With only a boolean it could be hard to know precisely when to clear it, no? We can cleat it when the time out threads' Set's size() is 0? I agree that this issue is mostly about IndexReader (and hence the name), and that the scenario of IndexWriter is weaker. But a utility class together w/ the TimeLimitedIndexReader example can help someone write a TimeLimitedIndexWriter very easily, and/or reuse this utility elsewhere. > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725182#action_12725182 ] Eks Dev commented on LUCENE-1720: - Sure, I just wanted to "sharpen definition" what is Lucene core issue, and what we can leave to end users. It is not only about the time, rather about canceling search requests (even better, general activities). > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1722) SmartChineseAnalyzer javadoc improvement
SmartChineseAnalyzer javadoc improvement Key: LUCENE-1722 URL: https://issues.apache.org/jira/browse/LUCENE-1722 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Chinese -> English, and corrections to match reality (removes several javadoc warnings) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725176#action_12725176 ] Mark Harwood commented on LUCENE-1720: -- bq. Oh, I did not mean to skip this check. But the check is on a variable with a yes/no state. We need to cater for >1 simultaneous timeout error condition in play. With only a boolean it could be hard to know precisely when to clear it, no? bq. Mark here wanted to provide a much more generalized way of stopping any other activity, not just search To be fair I think the use case for IndexWriter is weaker. In reader you have multiple users all expressing different queries and you want them all to share nicely with each other. In index writing it's typically a batch system indexing docs and there's no "fairness" to mediate? Breaking it out into a utility class seems like a good idea anyway. > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725172#action_12725172 ] Shai Erera commented on LUCENE-1720: bq. ... quickly testing a single volatile boolean, "anActivityHasTimedOut". Oh, I did not mean to skip this check. After anActivityHasTimedOut is true, instead of comparing Thread.currentThread() to firstAnticipatedThreadToFail, check if Thread.currentThread() is in the failed HashSet of threads, or something like that. I totally agree this should be kept and used that way, and it's probably better than numberOfTimedOutThreads since we don't need to inc/dec the latter every failure, just set a boolean flag and test it. bq. Imo, the problem can be reformulated as "Provide possibility to cancel running queries on best effort basis, with or without providing so far collected results". That's where we started from, but Mark here wanted to provide a much more generalized way of stopping any other activity, not just search. With this utility class, someone can implement a TimeLimitedIndexWriter which times out indexing, merging etc. Search is just one operation which will be covered as well. I also think that TimeLimitingCollector already provides a possibility to "cancel running queries on a best effort basis" and therefore if someone is interested in just that, he doesn't need to use TimeLimitedIndexReader. However this approach seems much more simple if you want to ensure queries are stopped ASAP, w/o passing a Timeout object around or anything. This approach also guarantees (I think) that any custom Scorer which does a lot of work, but uses IndexReader for that, will be stopped, even if the Scorer's developer did not implement a Timeout mechanism. Right? > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725168#action_12725168 ] Eks Dev commented on LUCENE-1720: - it's been late for this issue, but maybe worth thinking about. We could change semantics of this problem completely. Imo, the problem can be reformulated as "Provide possibility to cancel running queries on best effort basis, with or without providing so far collected results" That would leave Timer management to the end users and make an issue focus on one "Lucene core" ... Timeout management can be then provided as an example somewhere "How to implement Timeout management using ..." > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725164#action_12725164 ] Mark Harwood commented on LUCENE-1720: -- Currently the class hinges on a "fast fail" mechanism whereby all the many calls checking for a timeout are very quickly testing a single volatile boolean, "anActivityHasTimedOut". 99.99% of calls are expected to fail this test (nothing has timed out) and fail quickly - I was reluctant to add any hashset lookup etc in there needed to determine failure. With that as a guiding principle maybe the solution is to change volatile boolean anActivityHasTimedOut into volatile int numberOfTimedOutThreads; which would cater for >1 error condition at once. The fast-fail check then becomes: if(numberOfTimedOutThreads > 0) { if(timedoutThreads.contains(Thread.currentThread) { timedoutThreads.remove(Thread.currentThread); numberOfTimedOutThreads=timedoutThreads.size(); throw RuntimeException. } } > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: customizing lucene formula
See the Payloads functionality along with the BoostingTermQuery. On Jun 28, 2009, at 6:23 PM, B0DYLANG wrote: Thanks for your response, what i want to do is to add a function like the log to the well know lucene formula, this function will take its argument from the alredy indexed data, for example if we add a field like this new Field("terms","word,100;word,300",); so when the score returned the second word will have higher score from the first one Grant Ingersoll-6 wrote: The source code is available. I'd start with the Similarity class and see if it can be used. Before that, however, you might describe what it is you are interested in doing. Perhaps there is an alternate way that doesn't involve editing the source. On Jun 26, 2009, at 4:31 AM, B0DYLANG wrote: Dears, i want to add some arguments to the lucene formula or override it, is there a mean of doing so ? thanks for your response. -- View this message in context: http://www.nabble.com/customizing-lucene-formula-tp24216772p24216772.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/customizing-lucene-formula-tp24216772p24246152.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725144#action_12725144 ] Shai Erera commented on LUCENE-1720: In stop(), shouldn't the 'else' part be reached only if the firstAnticipatedThreadToFail == Thread.currentThread()? Currently, if no thread has timed out, and I'm not the firstAnticipatedThreadToFail, the code will still look for a new candidate, and probably find the same firstAnticipatedThreadToFail. Right? Also, even though that's somewhat mentioned in the class, we don't support multiple timing out threads, and I'm not sure if that's good. Currently, if two threads time out, and the calling thread to checkTimeOutIsThisThread() is not firstAnticipatedThreadToFail, it will continue processing. That may not be good, if the other thread is busy-waiting somewhere, and may not call checkTimeOutIsThisThread for a long time. What if we change firstAnticipatedThreadToFail to a HashSet and call contains()? It's slower than '==', but safer, which is also an important aspect of this utility. TimeoutThread can add all the timeoud threads to this HashSet, when it detects a timeout has occurred (by iterating on all the 'registered' threads and their expected time out time, and compare to the current time). What do you think? > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1721) IndexWriter to allow deletion by doc ids
[ https://issues.apache.org/jira/browse/LUCENE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725141#action_12725141 ] Tim Smith commented on LUCENE-1721: --- i suppose even that approach would cause problems if segments merge between getting the segment number/local doc pair and actuallly asking for the delete > IndexWriter to allow deletion by doc ids > > > Key: LUCENE-1721 > URL: https://issues.apache.org/jira/browse/LUCENE-1721 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon > > It would be great if IndexWriter would allow for deletion by doc ids as well. > It makes sense for cases where a "query" has been executed beforehand, and > later, that query needs to be applied in order to delete the matched > documents. > More information here: > http://www.nabble.com/Delete-by-docId-in-IndexWriter-td24239930.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1721) IndexWriter to allow deletion by doc ids
[ https://issues.apache.org/jira/browse/LUCENE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725140#action_12725140 ] Tim Smith commented on LUCENE-1721: --- how about a delete method on the IndexWriter that takes a segment number and a document id it would also be required to add methods to the IndexReader to get the segment number and local document id for a docid, but this should then work just fine > IndexWriter to allow deletion by doc ids > > > Key: LUCENE-1721 > URL: https://issues.apache.org/jira/browse/LUCENE-1721 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon > > It would be great if IndexWriter would allow for deletion by doc ids as well. > It makes sense for cases where a "query" has been executed beforehand, and > later, that query needs to be applied in order to delete the matched > documents. > More information here: > http://www.nabble.com/Delete-by-docId-in-IndexWriter-td24239930.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1721) IndexWriter to allow deletion by doc ids
[ https://issues.apache.org/jira/browse/LUCENE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725109#action_12725109 ] Michael McCandless commented on LUCENE-1721: This is a frequently requested feature, and I agree it'd be useful, but the problem is docID is in general not usable in the context of a writer since docIDs shift when segments that have deletions are committed. > IndexWriter to allow deletion by doc ids > > > Key: LUCENE-1721 > URL: https://issues.apache.org/jira/browse/LUCENE-1721 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon > > It would be great if IndexWriter would allow for deletion by doc ids as well. > It makes sense for cases where a "query" has been executed beforehand, and > later, that query needs to be applied in order to delete the matched > documents. > More information here: > http://www.nabble.com/Delete-by-docId-in-IndexWriter-td24239930.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1721) IndexWriter to allow deletion by doc ids
IndexWriter to allow deletion by doc ids Key: LUCENE-1721 URL: https://issues.apache.org/jira/browse/LUCENE-1721 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon It would be great if IndexWriter would allow for deletion by doc ids as well. It makes sense for cases where a "query" has been executed beforehand, and later, that query needs to be applied in order to delete the matched documents. More information here: http://www.nabble.com/Delete-by-docId-in-IndexWriter-td24239930.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org