[jira] Commented: (LUCENE-2306) contrib/xml-query-parser: NumericRangeFilter support
[ https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850494#action_12850494 ] Mark Harwood commented on LUCENE-2306: -- bq. Should I commit? Yes, thanks, Uwe. Missed that test/package. Cheers Mark contrib/xml-query-parser: NumericRangeFilter support Key: LUCENE-2306 URL: https://issues.apache.org/jira/browse/LUCENE-2306 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 3.0.1 Reporter: Jingkei Ly Assignee: Mark Harwood Fix For: 3.1 Attachments: LUCENE-2306.patch, LUCENE-2306.patch Create a FilterBuilder for NumericRangeFilter so that it may be used with the XML query parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2306) contrib/xml-query-parser: NumericRangeQuery and -Filter support
[ https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850497#action_12850497 ] Mark Harwood commented on LUCENE-2306: -- FYI, re changes to defaults. I try to keep the DTD up to date with all these changes. Having done that I then have to manually run the dtdocbuild to generate nice HTML docs . This is currently not automated because of uncertainty about dragging dtddoc and dependencies into lucene builds. It's a bit of a pain but html docs are useful and I'm hoping to add smart dtd-driven query entry into Luke. contrib/xml-query-parser: NumericRangeQuery and -Filter support --- Key: LUCENE-2306 URL: https://issues.apache.org/jira/browse/LUCENE-2306 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 3.0.1 Reporter: Jingkei Ly Assignee: Mark Harwood Fix For: 3.1 Attachments: LUCENE-2306.patch, LUCENE-2306.patch Create a FilterBuilder for NumericRangeFilter so that it may be used with the XML query parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2306) contrib/xml-query-parser: NumericRangeFilter support
[ https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-2306. -- Resolution: Fixed Fix Version/s: 3.1 Assignee: Mark Harwood Committed in revision 928114 contrib/xml-query-parser: NumericRangeFilter support Key: LUCENE-2306 URL: https://issues.apache.org/jira/browse/LUCENE-2306 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 3.0.1 Reporter: Jingkei Ly Assignee: Mark Harwood Fix For: 3.1 Attachments: LUCENE-2306.patch, LUCENE-2306.patch Create a FilterBuilder for NumericRangeFilter so that it may be used with the XML query parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835961#action_12835961 ] Mark Harwood commented on LUCENE-1486: -- Double Ugh. Applying the patch for the non-default field bug doesn't work any more because the latest ComplexPhraseQueryParser source sitting in contrib now has a different package to the QueryParser base class . This means that this subclass doesn't have the required write access to the package-protected field variable. This is needed to temporarily set the context of the parser when processing the inner contents of the phrase. Fixing this would require changing the package name of ComplexPhraseQueryParser or changing the visibility of field in the QueryParser base class to protected. Anyone have any strong feelings about which of these is the most acceptable? Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 3.1 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834819#action_12834819 ] Mark Harwood commented on LUCENE-1720: -- bq. How do we proceed from here? Is there a committer that's willing to look at the code I have commit rights but I'd like to find some time to add the benchmarking code first and also trial it in a live environment. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822 ] Mark Harwood commented on LUCENE-329: - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822 ] Mark Harwood commented on LUCENE-329: - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833833#action_12833833 ] Mark Harwood commented on LUCENE-1720: -- bq. Anyway, I'm putting that aside for now, and moving no to adding more tests to TestTimeLimitingReader. OK. I always shudder when I see lists of if instanceof... logic. My suggestion of getWrappedReader was intended for broader use - there are other reasons to wrap a reader e.g. security. I was thinking of putting it on IndexReader but maybe the convenience wrapper base class FilterIndexReader would be a better home - most reader-wrappers would use this as a base class? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833840#action_12833840 ] Mark Harwood commented on LUCENE-329: - My best-practice suggestion isn't as simple as offering a choice between preserving IDF for all terms or not. Instead, it is a proposal that we should use the *input* term's IDF for scoring all variants of the same root term (or taking an average of variants where the root term does not exist). This I feel preserves the benefits of keeping IDF as a factor (as in my John~ Patitucci~ balancing example) but also eliminating the side effects we see where a rare mis-spelling beats exact matches. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833863#action_12833863 ] Mark Harwood commented on LUCENE-1720: -- bq. BTW found and fixed a bug in TimeLimitingIndexReader.reopen which returned the wrapped reopened instance if it wasn't changed, instead of itself Good catch. bq. We can get over that by offering a protected getNewInstance(IndexReader) which will be overridden by sub-classes Would that be abstract? That would effectively help force subclasses to do the right thing when reopening but introduce a back-compatibility issue. If we don't make it abstract what would be the default implementation of this method? Maybe it's all best handled by simply adding a note saying you really should think about overriding reopen in FilterIndexReader's javadocs? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876 ] Mark Harwood commented on LUCENE-329: - bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Johana is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876 ] Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM: -- bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Joahn is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. was (Author: markh): bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Johana is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833902#action_12833902 ] Mark Harwood commented on LUCENE-1720: -- bq. Mark, the only thing that remains is to convert TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind taking a stab at this? Will need to look at existing benchmark tasks for guidance. I may get some time later. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832987#action_12832987 ] Mark Harwood commented on LUCENE-1720: -- bq. I also want to add a TestTimeLimitedIndexReader. To simplify this I started down the route of making core's TestIndexReader subclassable for testing any IndexReader wrappers such as ours. This involves centralising all the r= IndexReader.open(..) calls into a single overridable getReader method. The TimeLimitingIndexReader then becomes just this: {code:title=TestTimeLimitingIndexReader.java|borderStyle=solid} public class TestTimeLimitingIndexReader extends TestIndexReader{ public TestTimeLimitingIndexReader(String name) { super(name); } @Override public IndexReader getReader(Directory dir, boolean readOnly) throws CorruptIndexException, IOException { return new TimeLimitedIndexReader( super.getReader(dir, readOnly)); } } {code} Having done this there were some test failures - notably calls to SegmentReader.getOnlySegmentReader(IndexReader reader) because it has a bunch of instanceof testing code that doesn't expect our wrapper. This is a general Lucene issue. If we support Reader-wrapping as a concept (FilterIndexReader certainly suggests this) then it might make sense to provide a method call to getWrappedReader in the same way java.lang.Exception introduced a standard getCause method in java 1.4(?) because prior to that unwrapping objects required specialised knowledge of each wrapper class. This is perhaps another Jira issue and related changes to Junit tests. I'll attach an updated patch with the Junit test that currently fails on these instanceof checks TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: Lucene-1720.patch Updated patch with TestTimeLimitingIndexReader and changes to core TestIndexReader to support easy testing of IndexReader wrapper classes TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833013#action_12833013 ] Mark Harwood commented on LUCENE-1720: -- bq. I think we should add some search timeout tests to it, Yep, I left a TODO in there to cover this. bq. I'll do that while I'm working on the ConurrentHashMap thing, if you don't mind. Great stuff. I'll leave this with you until further notice. Thanks TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832444#action_12832444 ] Mark Harwood commented on LUCENE-1720: -- Thanks for the updates, Shai. Agreed on removing the treemap comment.. As you suggest, their may be a low-level accuracy timing issue under heavy load but for the typically longer timeout settings we may set this is less likely to be an issue. Related: I did think of another feature for ATM - timeouts will typically be set to the maximum bearable value that can be sustained by the hardware without upsetting lots of users/customers who need answers. This setting is therefore a tough business decision to make and is likely to be on the high side to avoid annoying customers (10 seconds? 30?). The current monitoring solution only aborts at the latest possible stage when the uppermost acceptable limit has been reached and expensive resource has already been burned. Maybe we could add a progress-testing method to ATM which can throw an exception earlier e.g. public void checkForProjectedActivityTimeout(float percentActivityCompletedSoFar) Clients would need to estimate how far through a task they were and call this method periodically. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832470#action_12832470 ] Mark Harwood commented on LUCENE-1720: -- The change to ATM isn't that big - as you say just adding start to the data on each thread. Here's an (untested) example {code:title=Bar.java|borderStyle=solid} /** * Checks to see if this thread is likely to exceed it's pre-determined timeout. * This is a heavier-weight call than checkForTimeout and should not be called quite as frequently * * Throws {...@link ActivityTimedOutException}RuntimeException in the event of any anticipated timeout. * @param progress */ public static final void checkProjectedTimeoutOnThisThread(float progress) { Thread currentThread=Thread.currentThread(); synchronized(timeLimitedThreads) { ActivityTime thisTimeOut = timeLimitedThreads.get(currentThread); if(thisTimeOut!=null ) { long now=System.currentTimeMillis(); long maxDuration=thisTimeOut.scheduledTimeout-thisTimeOut.startTime; long durationSoFar=now-thisTimeOut.startTime; float expectedMinimumProgress=(float)durationSoFar/maxDuration; if(progressexpectedMinimumProgress) { long expectedOverrun=(long) (((durationSoFar*(1f-progress))+now)-thisTimeOut.scheduledTimeout); throw new ActivityTimedOutException(Thread +currentThread+ is expected to time out, estimated overrun = +expectedOverrun+ ms,expectedOverrun); } } } } static class ActivityTime { public ActivityTime(long startTime, long timeOutTime) { this.startTime=startTime; this.scheduledTimeout=timeOutTime; } long startTime; long scheduledTimeout; } {code} I agree it will be challenging to work out when to call this from readers etc and how to estimate completeness but as a general utility class (as you suggest, in o.a.l.util ) it seems like a useful addition. My suspicion is that this is currently contrib - but then TimeLimitingCollector is currently in core. Maybe TimeLimitingCollector could be rewritten to use ATM and then we maintain a common generally reusable implementation? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832483#action_12832483 ] Mark Harwood commented on LUCENE-1720: -- Agreed, might be useful to provide boolean response to the progress method - a kind of how am I doing? check. We can always provide a convenience wrapper method which throws an exception : ATM.blowUpIfNotGoingFastEnough(float progress) Re TimeLimitingCollector - agreed, you really do need to protect ATM/start/stop calls in the same try...finally block. Maybe ATM could have a start method variant that takes an additional alreadyRunningSince argument as opposed to the existing assumption that the activity is starting right now. The first collect could then call this with a timestamp initialised in the constructor. Even then, there is the issue of where to put the stop call - collector has no close call to signal the end of the activity. Doesn't seem like TimeLimitingCollector can be based on the same ATM code. Shame. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832500#action_12832500 ] Mark Harwood commented on LUCENE-1720: -- I'll pick this up TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: Lucene-1720.patch Moved ATM to o.a.l.util package Added isProjectedToTimeout method to ATM and corresponding Junit test Removed treemap comments TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832721#action_12832721 ] Mark Harwood commented on LUCENE-1720: -- bq. When's this ready to test with Solr? I think the API is pretty stable - call try..start..finally...stop around time-critical stuff and use a TimeLimitedIndexReader to wrap your IndexReader. Internally the implementation feels reasonably stable too. In my tests it doesn't seem to add too much overhead to calls - I was getting response times of 3400 milliseconds on a heavy wikipedia query with TimeLimitedIndexReader versus 3300 for the same query on a raw IndexReader without timeout protection. I'm tempted to try put the timeout check calls directly into a version of IndexReader rather than in a delegating reader wrapper just to try see if the wrapper code is where the bulk of the extra overhead comes in. I'd hate to add any overhead to core IndexReader but I'm keen to see just how low-cost this check can get. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: ActivityTimeMonitor.java TestTimeLimitedIndexReader.java TimeLimitedIndexReader.java Updated to work with Lucene 2.9.1 and 3.0.0 Fixed NullPointer when reporting timedout threads TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text
[ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-725: Attachment: NovelAnalyzer.java Updated for new 3.0 APIs NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text --- Key: LUCENE-725 URL: https://issues.apache.org/jira/browse/LUCENE-725 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mark Harwood Assignee: Otis Gospodnetic Priority: Minor Attachments: NovelAnalyzer.java, NovelAnalyzer.java, NovelAnalyzer.java This is a class I have found to be useful for analyzing small (in the hundreds) collections of documents and removing any duplicate content such as standard disclaimers or repeated text in an exchange of emails. This has applications in sampling query results to identify key phrases, improving speed-reading of results with similar content (eg email threads/forum messages) or just removing duplicated noise from a search index. To be more generally useful it needs to scale to millions of documents - in which case an alternative implementation is required. See the notes in the Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782521#action_12782521 ] Mark Harwood commented on LUCENE-1486: -- Ugh. There's probably two separate actions required here then: 1) a bug needs raising on Lucene. 2) guidance needed from the Solr team about preferred course of action Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 3.1 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1999) Match spotter for all query types
[ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768257#action_12768257 ] Mark Harwood commented on LUCENE-1999: -- bq. and 2) you need it for every single doc visited by the query Actually I don't need it for every doc, only the top ones - it just happens to be so cheap to produce that I can afford to run this in-line with the query. (I haven't actually benchmarked it at scale buy my gut feel is it would be fast ) I was thinking that this might be orthogonal to the existing free-text based highlighter. The logic for this being roughly that 1) Highlighting of free-text fields is reasonably well-catered for with summarisation etc. 2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, Cached term filters on enums eg gender:male/female) are all likely to be non-free-text fields which don't require summarisation and only contain a single value. I may be wrong in these assumptions about the existing state of play (any thoughts, Mark M?) but it might be useful to think of attacking the problem with these 2 different requirements in mind. Regardless of type e.g. int, long etc I tend to think of fields as falling into these broad usage categories: a) Identifiers (e.g. primary keys) b) Quantifiers (e.g numerics, dates, spatial) c) Free-text d) Controlled vocabularies (e.g. enums such as gender:m/f) Type a ) is catered for with a straight TermQuery and therefore can be handled with the existing highlighter Type b) needs special indexes/queries (spatial/trie) and isn't catered for by the existing term/span-based Highlighter Type c) is catered for with the existing highlighter and its summarising features Type d) involves many TermDoc.next reads so is usefully cached as filters and therefore not catered for by existing Highlighter So this patch helps cater for types b) and d) where simply knowing the field matched is all that is required to highlight. Match spotter for all query types - Key: LUCENE-1999 URL: https://issues.apache.org/jira/browse/LUCENE-1999 Project: Lucene - Java Issue Type: New Feature Affects Versions: 2.9 Reporter: Mark Harwood Attachments: matchflagger.patch Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica. This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score. Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query. The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7 Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs. This may be something we should consider. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1999) Match spotter for all query types
Match spotter for all query types - Key: LUCENE-1999 URL: https://issues.apache.org/jira/browse/LUCENE-1999 Project: Lucene - Java Issue Type: New Feature Affects Versions: 2.9 Reporter: Mark Harwood Attachments: matchflagger.patch Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica. This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score. Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query. The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7 Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs. This may be something we should consider. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1999) Match spotter for all query types
[ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1999: - Attachment: matchflagger.patch Match spotter for all query types - Key: LUCENE-1999 URL: https://issues.apache.org/jira/browse/LUCENE-1999 Project: Lucene - Java Issue Type: New Feature Affects Versions: 2.9 Reporter: Mark Harwood Attachments: matchflagger.patch Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica. This patch provides the ability to wrap *any* Query objects and record match info as flags encoded in the overall document score. Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query. The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7 Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs. This may be something we should consider. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762290#action_12762290 ] Mark Harwood commented on LUCENE-1910: -- 2 minutes to create a query based on 10,000 documents? Unfortunately, I can't see this being generally useful until the performance is improved dramatically. Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757924#action_12757924 ] Mark Harwood commented on LUCENE-1910: -- Hi Thomas, Following your request for feedback, some initial thoughts from a very quick look. * The Information Gain algo could use a little more explanation e.g. using variable names other than num1 and num2 and could perhaps be extracted into a utility class * Is this scalable? It looks like in initialize it is loading this: {code:title=MoreLikeThisUsingTags.java|borderStyle=solid} /** * All terms in the index */ protected HashSet docTerms=new HashSet(); {code} ..that seems a little scary! It's also doing a seperate BooleanQuery for all items in this list ( and repeated for 1 tag?). Thats look like a lot of searches. I need to spend a little more time looking at it before I understand it in more detail. Before then - have you tested this on a big (millions of docs/terms) index? Some performance figures would be useful to accompany this. Cheers, Mark Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748046#action_12748046 ] Mark Harwood commented on LUCENE-1486: -- It does not stand on it's own as it is merely a temporary object used as a peculiarity in the way the parsing works. The SpanQuery family would be the legitimate standalone equivalents of this class. ComplexPhraseQuery objects are constructed during the the first pass of parsing to capture everything between quotes as an opaque string. The ComplexPhraseQueryParser then calls parsePhraseElements(...) on these objects to complete the process of parsing in a second pass where in this context any brackets etc take on a different meaning There is no merit in making this externally visible. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Miller Priority: Minor Fix For: 3.0, 3.1 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: ActivityTimeMonitor.java Had another run at ActivityTimeMonitor tonight and rationalised the code based on earlier comments. It should now cater for multiple simultaneous timeouts more cleanly. I'm concentrating on robustness with this currently - there's a TODO comment in the code that captures a small remaining inefficiency in iterating through all threads' data rather than using some form of time-sorted list. There was a suggestion in the earlier Jira comments re TreeMap might be a simple alternative but see my Java code comments as to why this is unlikely to work. Optimising this is likely to require the introduction of yet another data structure but this will add a runtime cost to maintain it - a cost I'm not sure is justified. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737270#action_12737270 ] Mark Harwood commented on LUCENE-1486: -- No objections to pulling from core given the impending deprecation of the QueryParser base class. I know of at least 2 folks using it so moving it to contrib would help provide somewhere to maintain fixes while we wait for the new QueryParser to incorporate the complex phrase features. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: Lucene-1486 non default field.patch Fix for phrases using QueryParser's non-default field e.g. author:j* smith Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734148#action_12734148 ] Mark Harwood commented on LUCENE-1486: -- I'll try and catch up with some of the issues raised here: bq. What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase Correct, the inner phrase example was a term not a phrase. This is perhaps a better example: checkBadQuery(\jo* \percival smith\ \); //phrases inside phrases is bad bq. I'm trying now to figure out what is supported The Junit is currently the main form of documentation - unlike the XMLQueryParser (which has a DTD) there is no syntax to formally capture the logic. Here is a basic summary of the syntax supported and how it differs from normal non-phrase use of the same operators: * Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) * Brackets are used to group/define the acceptable variations for a given phrase element e.g. (john OR jonathon) smith * AND is irrelevant - there is effectively an implied AND_NEXT_TO binding all phrase elements To move this forward I would suggest we consider following one of these options: 1) Keep in core and improve error reporting and documentation 2) Move into contrib as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable within phrase operators e.g. *, ~, ( ) I think 1) is achievable if we carefully define where the existing parser breaks (e.g. ANDs and nested brackets) 2) is unnecessary if we can achieve 1). 3) would be a shame if we lost useful features for some very convoluted edge cases 4) is beyond my JavaCC skills. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734176#action_12734176 ] Mark Harwood commented on LUCENE-1720: -- bq. Hey Mark. Have you made any progress with that? Apologies, recently the lure of developing apps for the new iPhone has put paid to that :) I'm still happy that the pseudo-code we outlined in the last couple of comments is what is needed to finish this. bq.We can tag team if you want Yep, happy to do that. Let me know if you start work to avoid me duplicating effort and I'll do the same. Cheers Mark TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734337#action_12734337 ] Mark Harwood commented on LUCENE-1486: -- bq. I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported. bq. Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)? In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this. Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default. bq. Mark H, can you please elaborate more on the these other operators + - ^ AND || NOT ! : [ ] { }. OK I'll try and deal with them one by one but these are not necessarily definitive answers or guarantees of correctly implemented support OR,||,+, AND, . ignored. The implicit operator is AND_NEXT_TO apart from in bracketed sections where all elements at this level are ORed ^ .boosts are carried through from TermQuerys to SpanTermQuerys NOT, ! Creates SpanNotQueries []{} range queries are supported as are wildcards *, fuzzies ~, ? bq. query: '(john OR jonathon) smith~0.3 order*' order:sell stock market I'll post the XML query syntax equivalent of what should be parsed here shortly (just seen your next comment come in) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734349#action_12734349 ] Mark Harwood commented on LUCENE-1486: -- {quote}for test checkMatches(\(jo* -john) smyth\, 2); would document 5 be returned or just doc 2 should be returned, {quote} I presume you mean smith not smyth here otherwise nothing would match? If so, doc 5 should match and position is relevant (subject to slop factors). {quote} Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work {quote} I suppose there's an open question as to if the second example is legal (the brackets are unnecessary) {quote} Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. {quote} That looks like a bug related to slop factor? {quote} Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with {quote} ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734355#action_12734355 ] Mark Harwood commented on LUCENE-1486: -- {quote} query: '(john OR jonathon) smith~0.3 order*' order:sell stock market {quote} Would be parsed as follows (shown as equivalent XMLQueryParser syntax) {code:xml} BooleanQuery Clause occurs=should SpanNear SpanOr SpanOrTermsjohn jonathon /SpanOrTerms /SpanOr SpanOr SpanOrTerms smith smyth/SpanOrTerms /SpanOr SpanOr SpanOrTerms order orders/SpanOrTerms /SpanOr /SpanNear /Clause Clause occurs=should TermQuery fieldName=order sell/TermQuery /Clause Clause occurs=should UserQuerystock market/UserQuery /Clause /BooleanQuery {code} Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727685#action_12727685 ] Mark Harwood commented on LUCENE-1486: -- Hi Mark, Mind if I try committing this patch? I've just switched from PC to Mac and my dev environment is all changed (Subclipse vs TortoiseSvn etc) and I wouldn't mind checking my config and commit rights still work in this new environment. If anyone has any mac/subclipse-related gotchas I should be aware of, do let me know. Cheers Mark Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-1486. Resolution: Fixed Committed in 791579 - http://svn.apache.org/viewvc?rev=791579view=rev Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726064#action_12726064 ] Mark Harwood commented on LUCENE-1720: -- re points 1,2,3 - yep, will change. re the question - yes, TimeoutThread should call the existing resetFirstAnticipatedFailure() method to advance timeout monitoring immediately to the next candidate - it currently requires the first bad Thread to call stop() before monitoring is advanced to spot the next bad thread. I think a useful safety measure is to manage clients that don't call stop() (e.g. forgetting to code a finally...stop) but this is likely to add complexity to ActivityTimeMonitor so I want to get a basic version solid first before thinking too much about this. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726128#action_12726128 ] Mark Harwood commented on LUCENE-1720: -- Maybe we should start by debugging some guiding principles: 1) There is a holding list of active threads that are of indeterminate status 2) There is a list of threads that are known to have timed out 3) The monitoring thread has the job of moving items from 1) to 2) and waits for firstAnticipatedTimeout and is notify-ed if firstAnticipatedTimeout changes 4) Start() adds a thread to 1) 5) Stop() removes a thread from 1) or 2) 6) Check() throws an exception if anActivityHasTimedOut is true (for fast fail) and current thread is in 2) 7) Any modification to 2) should set anActivityHasTimedOut boolean flag = 2)'s size is 0. 8) Any modification to 1) should re-asses firstAnticipatedTimeout and notify 3) if changed TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725500#action_12725500 ] Mark Harwood commented on LUCENE-1720: -- bq. Maybe we can benchmark this approach See http://www.nabble.com/Improving-TimeLimitedCollector-td24174758.html#a24229185 The figures were produced by TestTimeLimitedIndexReader that is part of this Jira issue so you can try benchmarks on your own indexes. bq.if it slows down queries due to the the Thread.currentThread and hash lookup This lookup only happens when threads start or stop timed activities and when there is a timed out state - all other method invocations on TimeLimitedIndexReader eg termDocs.next() are simply testing a volatile boolean which is used to indicate if any timeout has occurred. This seems to be fast in my benchmarks. bq. maybe we can .. change the Lucene API such that we pass in an argument to the IndexReader methods where the timeout may be checked The current design uses static methods which remove the need to pass a timeout object as context everywhere but using this approach comes with the downside that a single client thread is unable to time 1 activity at once which we thought was a reasonable trade-off. See http://www.nabble.com/Re%3A-Improving-TimeLimitedCollector-p24234976.html TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725741#action_12725741 ] Mark Harwood commented on LUCENE-1720: -- bq. Ah, so we're assuming most actions don't timeout Yes, that's it. bq. (I'll volunteer to do the latter). Cool. I'll work on tidying up the classes under test as per comments earlier . TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: (was: ActivityTimeMonitor.java) TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: ActivityTimeMonitor.java Updated to allow 1 simultaneous timeout error to be handled TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725164#action_12725164 ] Mark Harwood commented on LUCENE-1720: -- Currently the class hinges on a fast fail mechanism whereby all the many calls checking for a timeout are very quickly testing a single volatile boolean, anActivityHasTimedOut. 99.99% of calls are expected to fail this test (nothing has timed out) and fail quickly - I was reluctant to add any hashset lookup etc in there needed to determine failure. With that as a guiding principle maybe the solution is to change volatile boolean anActivityHasTimedOut into volatile int numberOfTimedOutThreads; which would cater for 1 error condition at once. The fast-fail check then becomes: if(numberOfTimedOutThreads 0) { if(timedoutThreads.contains(Thread.currentThread) { timedoutThreads.remove(Thread.currentThread); numberOfTimedOutThreads=timedoutThreads.size(); throw RuntimeException. } } TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725176#action_12725176 ] Mark Harwood commented on LUCENE-1720: -- bq. Oh, I did not mean to skip this check. But the check is on a variable with a yes/no state. We need to cater for 1 simultaneous timeout error condition in play. With only a boolean it could be hard to know precisely when to clear it, no? bq. Mark here wanted to provide a much more generalized way of stopping any other activity, not just search To be fair I think the use case for IndexWriter is weaker. In reader you have multiple users all expressing different queries and you want them all to share nicely with each other. In index writing it's typically a batch system indexing docs and there's no fairness to mediate? Breaking it out into a utility class seems like a good idea anyway. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725197#action_12725197 ] Mark Harwood commented on LUCENE-1720: -- bq. any custom Scorer which does a lot of work, but uses IndexReader for that, will be stopped, even if the Scorer's developer did not implement a Timeout mechanism. Right? Correct. I'm not familiar with the proposal to pass around a Timeout object but I get the idea and the code here would certainly avoid that overhead. bq. We can cleat it when the time out threads' Set's size() is 0? Yes, that would work. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: ActivityTimedOutException.java TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1720: - Attachment: TimeLimitedIndexReader.java TestTimeLimitedIndexReader.java ActivityTimeMonitor.java TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: LUCENE-1486.patch Added fix for ConstantScoreQuery changes Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723742#action_12723742 ] Mark Harwood commented on LUCENE-1486: -- The fix was relatively straight-forward from what I could see. Just temporarily unset the QueryParser's ConstantScoreRewrite mode when performing the pass that is just evaluating query elements inside phrase queries. These clauses need to resolve to traditional BooleanQuery-full-of-termQueries in order that they can be inspected and rewritten as Span equivalents for complex phrases. Should do the job. Cheers Mark (Been far too busy with other things and missing getting my hands dirty here with Lucene!) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719115#action_12719115 ] Mark Harwood commented on LUCENE-1486: -- The primary reason (and perhaps not a particularly good one) was I didn't want to wade around in the Javacc syntax of the .jj file that generates the QueryParser and the required extensions could be made in a subclass. Also there is invariably a performance hit for supporting things like wildcards in phrase queries so rather than adding another off by default flag in the main parser and conditional logic to test if wildcards etc in phrases are allowed, the subclass could be seen as a specialised extension that is to be used by those that understand the trade-offs between functionality and performance. I can sympathise with the purist approach of having all parser syntax defined in Javacc though. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718573#action_12718573 ] Mark Harwood commented on LUCENE-1486: -- Perhaps hacky was too strong a word. I think it's a reasonable approach to handling the complexity involved in this logic. A colleague of mine has this running in production on a big installation with lots of users Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery
[ https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703733#action_12703733 ] Mark Harwood commented on LUCENE-1621: -- While we're poking around in this area I'd like to point out the long-standing open issue in LUCENE-329. Matching Smyth over Smith when doing a search for Smith~ is just plain broken but this is what I see all the time with FuzzyQuery and it's default approach to IDF. I think we need to take the sort of logic in contrib's FuzzyLikeThisQuery to address this. deprecate term and getTerm in MultiTermQuery Key: LUCENE-1621 URL: https://issues.apache.org/jira/browse/LUCENE-1621 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: LUCENE-1621.patch This means moving getTerm and term up to sub classes as appropriate and reimplementing equals, hashcode as appropriate in sub classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-1500. -- Resolution: Fixed Assignee: Mark Harwood (was: Mark Harwood) Committed in revision: 758460 Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Mark Harwood Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1500: - Attachment: Lucene-1500-NewException.patch With updated Apache license header. I'll commit soon if no objections Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Mark Harwood Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681287#action_12681287 ] Mark Harwood commented on LUCENE-1559: -- Sorry to be picky but can you submit a self-contained test with no external dependencies other than Lucene+Highlighter+JUnit I don't want POI versions to be a factor here. Cheers Mark Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681323#action_12681323 ] Mark Harwood commented on LUCENE-1559: -- Your code still imports POI and is now importing a .DOC file without parsing, producing garbage. You'll need to supply an example Junit which illustrates this problem with plain text before we can look at it. You should be able to turn the .Doc into text at your end using POI and then supply the file. Are you sure there isn't a problem with POI failing to parse the file correctly? Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681336#action_12681336 ] Mark Harwood commented on LUCENE-1559: -- Can I close this then as it appears to be an issue with your parser, not Lucene? Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681344#action_12681344 ] Mark Harwood commented on LUCENE-1559: -- Sorry...I don't know what I should do at this stage Give us a Junit example of your problem code when working with plain text (Not PDF, word or .doc) that clearly demonstrates where Lucene fails to index/search or highlight this text correctly. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681466#action_12681466 ] Mark Harwood commented on LUCENE-1559: -- I ran a quick test and I dont think I could see document in the Token.termText() of any tokens in the TokenStream you provide to the Highlighter. It's late and I need to be elsewhere but if you have time to pursue this check the above statement is true. If so, check the body text retrieved from Document.get(body) in the search results is the same as the String you store at index time (just in case the act of storing/retrieving has altered the text somehow). Will look into this more later Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681507#action_12681507 ] Mark Harwood commented on LUCENE-1559: -- Ah. Try set this highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE); Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-1559. Resolution: Invalid Working as designed with feature designed to prevent too-costly analysis Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681531#action_12681531 ] Mark Harwood commented on LUCENE-1522: -- I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? At some stage I hope to take a closer look at this contribution. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks. another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680125#action_12680125 ] Mark Harwood commented on LUCENE-1500: -- Will submit a new patch tonight. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Mark Harwood Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1500: - Attachment: (was: Lucene-1500-NewException.patch) Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Mark Harwood Fix For: 2.9 Attachments: LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1500: - Attachment: Lucene-1500-NewException.patch Added support for testing both Token start or end offset text.length. Added javadoc comments for new exception Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Mark Harwood Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677956#action_12677956 ] Mark Harwood commented on LUCENE-1500: -- My thoughts were that this exception solely traps inconsistencies with Tokens in relation to a particular provided chunk of text. I think internal inconsistencies within a Token (e.g. endOffset startOffset) should ideally be handled by Token (throwing something like an IllegalArgumentException in it's constructor). I guess an open question there is can startOffset=endOffset in a Token? Either way, String.substring simply returns an empty string so I think that's probably OK in highlighter. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677968#action_12677968 ] Mark Harwood commented on LUCENE-1500: -- Isn't your example predicated on being given an invalid Token with endstart? What did you think of my suggestion to fix this problem at it's source - i.e. Token should never be in a state with endstart in the first place? Acheiving this goal is complicated by the fact that offsets are not only set in the constructor - there are independent set methods for start and end offsets which can be called in any order. One solution would be to deprecate Token.setStartOffset and Token.endOffset and replacing with a Token.setExtent(int startOffset, int endOffset) with the appropriate checks. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677991#action_12677991 ] Mark Harwood commented on LUCENE-1500: -- I struggle to see why endOffsetstartOffset should ever be acceptable but also share your concerns about the disruption of changing the Token API to enforce this. So, I'll add code to the patch to check for bad startOffsets too. If we had more points of use for Token offsets outside of highlighting I'd be more concerned, but things being the way they are this seems like the most pragmatic option. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: ComplexPhraseQueryParser.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: ComplexPhraseQueryParser.java Updated to cater for phrase clauses that produce no matches Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: TestComplexPhraseQuery.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: TestComplexPhraseQuery.java Updated Junit test to test for phrases with clauses that produce no matches Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1500: - Attachment: Lucene-1500-NewException.patch Attached a patch with new checked exception. This will have a knock-on effect on all Highlighter client code (Solr?) as it introduces a new checked exception that must be handled. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677507#action_12677507 ] Mark Harwood commented on LUCENE-1500: -- OK - choices are: 1) Throw a RuntimeException with a more useful diagnostic message 2) Throw a new checked Exception (introducing possible compile errors in existing client code) 3) Check for the error condition and ignore (as done in the current patch) This feels to me like one of those there's something seriously wrong with the codebase problems rather than an invalid bit of data or user input which is external to the system so my personal preference is to lean towards 1). Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676633#action_12676633 ] Mark Harwood commented on LUCENE-1500: -- Hmmm. I'm not so sure that this defensive coding patch is the right thing to do here. One could argue that it is obscuring an error condition further upstream (as you suggest, Mike - a dodgy analyzer). Commiting this will only make these errors harder to detect e.g. we'd get forum posts saying why doesn't my term get highlighted? Perhaps we can turn this around and ask under what conditions is it acceptable to provide a TokenStream with tokens whose offsets exceed the length of the text provided?. Not sure I see a justifiable case for supporting that as a legitimate scenario and I would prefer the reporting of an error in this case. Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676745#action_12676745 ] Mark Harwood commented on LUCENE-1500: -- So to be consistent, where else in Lucene might an IncorrectTokenOffsetsException be a possibility - IndexWriter.addDocument(..)? Highlighter throws StringIndexOutOfBoundsException -- Key: LUCENE-1500 URL: https://issues.apache.org/jira/browse/LUCENE-1500 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Environment: Found this running the example code in Solr (latest version). Reporter: David Bowen Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: LUCENE-1500.patch, patch.txt Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh): adddoc field name=idTest for Highlighting StringIndexOutOfBoundsExcdption/field field name=nameSome Name/field field name=manuAcme, Inc./field field name=featuresDescription of the features, mentioning various things/field field name=featuresFeatures also is multivalued/field field name=popularity6/field field name=inStocktrue/field /doc/add and then the URL http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused the exception. I have a patch. I don't know if it is completely correct, but it avoids this exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens
[ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667654#action_12667654 ] Mark Harwood commented on LUCENE-1489: -- It looks to me like this could be fixed in the Formatter classes when marking up the output string. Currently classes such as SimpleHTMLFormatter in their highlightTerm method put a tag around the whole section of text, if it contains a hit, i.e. {code:title=SimpleHTMLFormatter.java|borderStyle=solid} public String highlightTerm(String originalText, TokenGroup tokenGroup) { StringBuffer returnBuffer; if(tokenGroup.getTotalScore()0) { returnBuffer=new StringBuffer(); returnBuffer.append(preTag); returnBuffer.append(originalText); returnBuffer.append(postTag); return returnBuffer.toString(); } return originalText; } {code} The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block. Some complexity may lie in handling token streams that produce tokens that rewind to earlier offsets. SimpleHtmlFormatter suddenly seems less simple! TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group. I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required. Cheers Mark highlighter problem with n-gram tokens -- Key: LUCENE-1489 URL: https://issues.apache.org/jira/browse/LUCENE-1489 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627... Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer: {code:java} public class TestNGramHighlighter { public static void main(String[] args) throws Exception { Analyzer analyzer = new NGramAnalyzer(); final String TEXT = Lucene can make index. Then Lucene can search.; final String QUERY = can; QueryParser parser = new QueryParser(f,analyzer); Query query = parser.parse(QUERY); QueryScorer scorer = new QueryScorer(query,f); Highlighter h = new Highlighter( scorer ); System.out.println( h.getBestFragment(analyzer, f, TEXT) ); } static class NGramAnalyzer extends Analyzer { public TokenStream tokenStream(String field, Reader input) { return new NGramTokenizer(input,2,2); } } } {code} expected output is: Lucene Bcan/B make index. Then Lucene Bcan/B search. but the actual output is: Lucene Bcan make index. Then Lucene can/B search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: TestComplexPhraseQuery.java More tests for Nots Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1, 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: ComplexPhraseQueryParser.java Added support for Nots in phrase queries e.g. -not interested Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1, 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: ComplexPhraseQueryParser.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1, 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: TestComplexPhraseQuery.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1, 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: ComplexPhraseQueryParser.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: ComplexPhraseQueryParser.java Fixed bug with plain phrase query, added support for range queries Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: (was: TestComplexPhraseQuery.java) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: TestComplexPhraseQuery.java Added tests for range queries and plain PhraseQueries Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: ComplexPhraseQueryParser.java QueryParser extension Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1486: - Attachment: TestComplexPhraseQuery.java Junit test Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.4.1 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653057#action_12653057 ] Mark Harwood commented on LUCENE-1473: -- The contrib section of Lucene contains an XML-based query parser which aims to provide full-coverage of Lucene queries/filters and provide extensibility to support 3rd party classes. I use this regularly in distributed deployments and this allows both non-Java clients and long-term persistence of queries with good stability across Lucene versions. Although I have not conducted formal benchmarks I have not been drawn to XML parsing as a bottleneck - search execution and/or document retrieves are normally the main bottlenecks. Maintaining XML parsing code is an overhead but ultimately helps decouple requests from the logic that executes requests. In serializing Lucene Query/Filter objects we are dealing with the classes which combine both the representation of the request criteria (what needs to be done) and the implementation (how things are done). We are forever finessing the how bit of this equation e.g. moving from RangeQuery to RangeFilters to TrieRangeFilter. The criteria however remains relatively static ( I just want to search on a range) and so it is dangerous to build clients that refer tdirectly to query implementation classes. The XML parser provides a language-independent abstraction for clients to define what they want to be done without being too tied to how this is implemented. Cheers Mark Implement standard Serialization across Lucene versions --- Key: LUCENE-1473 URL: https://issues.apache.org/jira/browse/LUCENE-1473 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-1473.patch Original Estimate: 8h Remaining Estimate: 8h To maintain serialization compatibility between Lucene versions, serialVersionUID needs to be added to classes that implement java.io.Serializable. java.io.Externalizable may be implemented in classes for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651418#action_12651418 ] Mark Harwood commented on LUCENE-1470: -- A note of caution - I noticed when moving from Lucene 2.3 to 2.4 that my similar scheme for encoding information meant that I couldn't encode information using byte arrays using bytes with values 216. The changes (I think in Lucene-510) introduced some code that modified the way the bytes were written/read and corrupted my encoding. Not sure if your proposed approach is prone to this or if anyone can cast further light on these encoding issues. Good to see this making its way into Lucene, Uwe. Add TrieRangeQuery to contrib - Key: LUCENE-1470 URL: https://issues.apache.org/jira/browse/LUCENE-1470 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 2.4 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch According to the thread in java-dev (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to include my fast numerical range query implementation into lucene contrib-queries. I implemented (based on RangeFilter) another approach for faster RangeQueries, based on longs stored in index in a special format. The idea behind this is to store the longs in different precision in index and partition the query range in such a way, that the outer boundaries are search using terms from the highest precision, but the center of the search Range with lower precision. The implementation stores the longs in 8 different precisions (using a class called TrieUtils). It also has support for Doubles, using the IEEE 754 floating-point double format bit layout with some bit mappings to make them binary sortable. The approach is used in rather big indexes, query times are even on low performance desktop computers 100 ms (!) for very big ranges on indexes with 50 docs. I called this RangeQuery variant and format TrieRangeRange query because the idea looks like the well-known Trie structures (but it is not identical to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647243#action_12647243 ] Mark Harwood commented on LUCENE-329: - This patch goes back a while. Contrib's FuzzyLikeThisQuery contains my current best practice for fuzzy matching but the logic is mixed in with code that also does LikeThis optimisations ie working out which input terms are the best to search on rather than using all input terms. This could usefully be lifted out and used elsewhere. The fuzzy scoring logic takes the IDF of the input term and uses that as the IDF for scoring all expanded variants. If the input term does not exist then all variants are rewarded with their averaged IDF. Coord is disabled. Using some form of IDF is typically desirable to balance a fuzzy query with other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy query (or wildcard or other auto-expanding queries) however I see no reason to differentiate between the auto-expanded terms with different IDF values. In my view these auto-expand queries should generally use the same IDF for all variants and only reward them differently based on edit distance or what other distance metric is meaningful to that form of expansion (e.g. age range query on age 40 +/- 10 years could reward based on closeness to input term 40). Cheers Mark Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647243#action_12647243 ] Mark Harwood commented on LUCENE-329: - This patch goes back a while. Contrib's FuzzyLikeThisQuery contains my current best practice for fuzzy matching but the logic is mixed in with code that also does LikeThis optimisations ie working out which input terms are the best to search on rather than using all input terms. This could usefully be lifted out and used elsewhere. The fuzzy scoring logic takes the IDF of the input term and uses that as the IDF for scoring all expanded variants. If the input term does not exist then all variants are rewarded with their averaged IDF. Coord is disabled. Using some form of IDF is typically desirable to balance a fuzzy query with other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy query (or wildcard or other auto-expanding queries) however I see no reason to differentiate between the auto-expanded terms with different IDF values. In my view these auto-expand queries should generally use the same IDF for all variants and only reward them differently based on edit distance or what other distance metric is meaningful to that form of expansion (e.g. age range query on age 40 +/- 10 years could reward based on closeness to input term 40). Cheers Mark Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1449) IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation
[ https://issues.apache.org/jira/browse/LUCENE-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1449: - Attachment: TestTransactionRollbackCapability.java Junit test IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation --- Key: LUCENE-1449 URL: https://issues.apache.org/jira/browse/LUCENE-1449 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Mark Harwood Priority: Minor Attachments: TestTransactionRollbackCapability.java I have been looking to provide the ability to rollback committed transactions and encountered some issues. I appreciate IndexDeletionPolicy's main motivation is to handle cleaning away OLD commit points but it does not explicitly state that it can or cannot be used to clean NEW commit points. If this is not supported then the documentation should ideally state this. If the intention is to support this behaviour then read on ... There seem to be 2 issues so far: 1) The first attempt to call IndexCommit.delete on the latest commit point fails to remove any contents. The subsequent call succeeds however 2) Deleting the latest commit point fails to update the segments.gen file to point to segments_N-1. New IndexReaders that are opened are then misdirected to open segments_N which has been deleted Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1449) IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation
[ https://issues.apache.org/jira/browse/LUCENE-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-1449: - Attachment: TestTransactionRollbackCapability2.java Thanks for the pointers, Mike. This new test now passes having made a few changes. IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation --- Key: LUCENE-1449 URL: https://issues.apache.org/jira/browse/LUCENE-1449 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Mark Harwood Assignee: Michael McCandless Priority: Minor Attachments: TestTransactionRollbackCapability.java, TestTransactionRollbackCapability2.java I have been looking to provide the ability to rollback committed transactions and encountered some issues. I appreciate IndexDeletionPolicy's main motivation is to handle cleaning away OLD commit points but it does not explicitly state that it can or cannot be used to clean NEW commit points. If this is not supported then the documentation should ideally state this. If the intention is to support this behaviour then read on ... There seem to be 2 issues so far: 1) The first attempt to call IndexCommit.delete on the latest commit point fails to remove any contents. The subsequent call succeeds however 2) Deleting the latest commit point fails to update the segments.gen file to point to segments_N-1. New IndexReaders that are opened are then misdirected to open segments_N which has been deleted Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]