[jira] Commented: (LUCENE-2306) contrib/xml-query-parser: NumericRangeFilter support

2010-03-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850494#action_12850494
 ] 

Mark Harwood commented on LUCENE-2306:
--

bq. Should I commit?

Yes, thanks, Uwe.  Missed that test/package. 
Cheers
Mark 

 contrib/xml-query-parser: NumericRangeFilter support
 

 Key: LUCENE-2306
 URL: https://issues.apache.org/jira/browse/LUCENE-2306
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Jingkei Ly
Assignee: Mark Harwood
 Fix For: 3.1

 Attachments: LUCENE-2306.patch, LUCENE-2306.patch


 Create a FilterBuilder for NumericRangeFilter so that it may be used with the 
 XML query parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2306) contrib/xml-query-parser: NumericRangeQuery and -Filter support

2010-03-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850497#action_12850497
 ] 

Mark Harwood commented on LUCENE-2306:
--

FYI, re changes to defaults. I try to keep the DTD up to date with all these 
changes. 
Having done that I then have to manually run the dtdocbuild to generate nice 
HTML docs . This is currently not automated because of uncertainty about 
dragging dtddoc and dependencies into lucene builds.
It's a bit of a pain but html docs are useful and I'm hoping to add smart 
dtd-driven  query entry into Luke. 


 contrib/xml-query-parser: NumericRangeQuery and -Filter support
 ---

 Key: LUCENE-2306
 URL: https://issues.apache.org/jira/browse/LUCENE-2306
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Jingkei Ly
Assignee: Mark Harwood
 Fix For: 3.1

 Attachments: LUCENE-2306.patch, LUCENE-2306.patch


 Create a FilterBuilder for NumericRangeFilter so that it may be used with the 
 XML query parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2306) contrib/xml-query-parser: NumericRangeFilter support

2010-03-26 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-2306.
--

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Harwood

Committed in revision 928114

 contrib/xml-query-parser: NumericRangeFilter support
 

 Key: LUCENE-2306
 URL: https://issues.apache.org/jira/browse/LUCENE-2306
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Jingkei Ly
Assignee: Mark Harwood
 Fix For: 3.1

 Attachments: LUCENE-2306.patch, LUCENE-2306.patch


 Create a FilterBuilder for NumericRangeFilter so that it may be used with the 
 XML query parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2010-02-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835961#action_12835961
 ] 

Mark Harwood commented on LUCENE-1486:
--

Double Ugh. Applying the patch for the non-default field bug doesn't work any 
more because the latest ComplexPhraseQueryParser source sitting in contrib now 
has a different package to the QueryParser base class . This means that this 
subclass doesn't have the required write access to the package-protected 
field variable. This is needed to temporarily set the context of the parser 
when processing the inner contents of the phrase.

Fixing this would require changing the package name of ComplexPhraseQueryParser 
or changing the visibility of field in the QueryParser base class to 
protected.
Anyone have any strong feelings about which of these is the most acceptable?

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.1

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
 field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834819#action_12834819
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. How do we proceed from here? Is there a committer that's willing to look at 
the code

I have commit rights but I'd like to find some time to add the benchmarking 
code first and also trial it in a live environment.


 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822
 ] 

Mark Harwood commented on LUCENE-329:
-

The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query  
for John~ Patitucci~ I'm probably more interested in a partial match on the 
rarer surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822
 ] 

Mark Harwood commented on LUCENE-329:
-

The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query  
for John~ Patitucci~ I'm probably more interested in a partial match on the 
rarer surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833833#action_12833833
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Anyway, I'm putting that aside for now, and moving no to adding more tests 
to TestTimeLimitingReader.

OK.

I always shudder when I see lists of if instanceof... logic.

My suggestion of getWrappedReader was intended for broader use - there are 
other reasons to wrap a reader e.g. security.
I was thinking of putting it on IndexReader but maybe the convenience wrapper 
base class FilterIndexReader would be a better home - most reader-wrappers 
would use this as a base class?


 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833840#action_12833840
 ] 

Mark Harwood commented on LUCENE-329:
-

My best-practice suggestion isn't as simple as offering a choice between 
preserving IDF for all terms or not.

Instead, it is a proposal that we should use the *input* term's IDF for scoring 
all variants of the same root term (or taking an average of variants where the 
root term does not exist).

This I feel preserves the benefits of keeping IDF as a factor (as in my John~ 
Patitucci~ balancing example) but also eliminating the side effects we see 
where a rare mis-spelling beats exact matches.


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833863#action_12833863
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. BTW found and fixed a bug in TimeLimitingIndexReader.reopen which returned 
the wrapped reopened instance if it wasn't changed, instead of itself

Good catch.

bq. We can get over that by offering a protected getNewInstance(IndexReader) 
which will be overridden by sub-classes

Would that be abstract? That would effectively help force subclasses to do the 
right thing when reopening but introduce a back-compatibility issue.
If we don't make it abstract what would be the default implementation of this 
method?
Maybe it's all best handled by simply adding a note saying you really should 
think about overriding reopen in FilterIndexReader's javadocs?



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876
 ] 

Mark Harwood commented on LUCENE-329:
-

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Johana is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876
 ] 

Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM:
--

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Joahn is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.

  was (Author: markh):
bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Johana is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.
  
 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833902#action_12833902
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Mark, the only thing that remains is to convert 
TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind 
taking a stab at this?

Will need to look at existing benchmark tasks for guidance. I may get some time 
later.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832987#action_12832987
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. I also want to add a TestTimeLimitedIndexReader.

To simplify this I started down the route of making core's TestIndexReader 
subclassable for testing any IndexReader wrappers such as ours.

This involves centralising all the r= IndexReader.open(..) calls into a 
single overridable getReader method. The TimeLimitingIndexReader then becomes 
just this:

{code:title=TestTimeLimitingIndexReader.java|borderStyle=solid}
public class TestTimeLimitingIndexReader extends TestIndexReader{
public TestTimeLimitingIndexReader(String name) {
super(name);
}
@Override
public IndexReader getReader(Directory dir, boolean readOnly)
throws CorruptIndexException, IOException   {
return new TimeLimitedIndexReader( super.getReader(dir, 
readOnly));
}   
}
{code}

Having done this there were some test failures - notably calls to 
SegmentReader.getOnlySegmentReader(IndexReader reader) because it has a bunch 
of instanceof testing code that doesn't expect our wrapper.

This is a general Lucene issue. If we support Reader-wrapping as a concept 
(FilterIndexReader certainly suggests this) then it might make sense to provide 
a method call to getWrappedReader in the same way java.lang.Exception 
introduced a standard getCause method in java 1.4(?) because prior to that 
unwrapping objects required specialised knowledge of each wrapper class. This 
is perhaps another Jira issue and related changes to Junit tests.

I'll attach an updated patch with the Junit test that currently fails on these 
instanceof  checks







 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: Lucene-1720.patch

Updated patch with TestTimeLimitingIndexReader and changes to core 
TestIndexReader to support easy testing of IndexReader wrapper classes

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833013#action_12833013
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. I think we should add some search timeout tests to it, 

Yep, I left a TODO in there to cover this. 

bq. I'll do that while I'm working on the ConurrentHashMap thing, if you don't 
mind.

Great stuff. I'll leave this with you until further notice.

Thanks 


 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832444#action_12832444
 ] 

Mark Harwood commented on LUCENE-1720:
--

Thanks for the updates, Shai.

Agreed on removing the treemap comment..
As you suggest, their may be a low-level accuracy timing issue under heavy load 
but for the typically longer timeout settings we may set this is less likely to 
be an issue. 

Related: I did think of another feature for ATM - timeouts will typically be 
set to the maximum bearable value that can be sustained by the hardware without 
upsetting lots of users/customers who need answers.
This setting is therefore a tough business decision to make and is likely to be 
on the high side to avoid annoying customers (10 seconds? 30?).
The current monitoring solution only aborts at the latest possible stage when 
the uppermost acceptable limit has been reached and expensive resource has 
already been burned.
Maybe we could add a progress-testing method to ATM which can throw an 
exception earlier e.g.
public void checkForProjectedActivityTimeout(float 
percentActivityCompletedSoFar)
Clients would need to estimate how far through a task they were and call this 
method periodically.



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832470#action_12832470
 ] 

Mark Harwood commented on LUCENE-1720:
--

The change to ATM isn't that big - as you say just adding start to the data 
on each thread.
Here's an (untested) example
{code:title=Bar.java|borderStyle=solid}
/**
 * Checks to see if this thread is likely to exceed it's pre-determined 
timeout. 
 * This is a heavier-weight call than checkForTimeout and should not 
be called quite as frequently
 * 
 * Throws {...@link ActivityTimedOutException}RuntimeException in the 
event of any anticipated timeout.
 * @param progress
 */
public static final void checkProjectedTimeoutOnThisThread(float 
progress)
{
Thread currentThread=Thread.currentThread();
synchronized(timeLimitedThreads)
{   
ActivityTime thisTimeOut = 
timeLimitedThreads.get(currentThread);
if(thisTimeOut!=null )
{
long now=System.currentTimeMillis();
long 
maxDuration=thisTimeOut.scheduledTimeout-thisTimeOut.startTime;
long durationSoFar=now-thisTimeOut.startTime;
float 
expectedMinimumProgress=(float)durationSoFar/maxDuration;
if(progressexpectedMinimumProgress)
{   

long expectedOverrun=(long) 
(((durationSoFar*(1f-progress))+now)-thisTimeOut.scheduledTimeout);
throw new 
ActivityTimedOutException(Thread +currentThread+ is expected to time out, 
estimated overrun =
+expectedOverrun+  
ms,expectedOverrun);
}
}
}
}   
static class ActivityTime
{
public ActivityTime(long startTime, long timeOutTime)
{
this.startTime=startTime;
this.scheduledTimeout=timeOutTime;
}
long startTime;
long scheduledTimeout;
}
{code} 

I agree it will be challenging to work out when to call this from readers etc 
and how to estimate completeness but as a general utility class (as you 
suggest, in o.a.l.util ) it seems like a useful addition.

My suspicion is that this is currently contrib - but then 
TimeLimitingCollector is currently in core.
Maybe TimeLimitingCollector could be rewritten to use ATM and then we maintain 
a common generally reusable implementation?






 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832483#action_12832483
 ] 

Mark Harwood commented on LUCENE-1720:
--

Agreed, might be useful to provide boolean response to the progress method - a 
kind of how am I doing? check.
We can always provide a convenience wrapper method which throws an exception : 
ATM.blowUpIfNotGoingFastEnough(float progress)

Re TimeLimitingCollector - agreed, you really do need to protect ATM/start/stop 
calls in the same try...finally block.
Maybe ATM could have a start method variant that takes an additional 
alreadyRunningSince argument as opposed to the existing assumption that the 
activity is starting right now. The first collect could then call this with a 
timestamp initialised in the constructor.
Even then, there is the issue of where to put the stop call - collector has 
no close call to signal the end of the activity.

Doesn't seem like TimeLimitingCollector can be based on the same ATM code. 
Shame.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832500#action_12832500
 ] 

Mark Harwood commented on LUCENE-1720:
--

I'll pick this up

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: Lucene-1720.patch

Moved ATM to o.a.l.util package
Added isProjectedToTimeout method to ATM and corresponding Junit test
Removed treemap comments

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832721#action_12832721
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. When's this ready to test with Solr?

I think the API is pretty stable - call try..start..finally...stop around 
time-critical stuff and use a TimeLimitedIndexReader to wrap your IndexReader.

Internally the implementation feels reasonably stable too.

In my tests it doesn't seem to add too much overhead to calls -  I was getting 
response times of 3400 milliseconds on a heavy wikipedia query with 
TimeLimitedIndexReader versus 3300 for the same query on a raw IndexReader 
without timeout protection.

I'm tempted to try put the timeout check calls directly into a version of 
IndexReader rather than in a delegating reader wrapper just to try see if the 
wrapper code is where the bulk of the extra overhead comes in. I'd hate to add 
any overhead to core IndexReader but I'm keen to see just how low-cost this 
check can get.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-10 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: ActivityTimeMonitor.java
TestTimeLimitedIndexReader.java
TimeLimitedIndexReader.java

Updated to work with Lucene 2.9.1 and 3.0.0 
Fixed NullPointer when reporting timedout threads

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text

2010-02-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-725:


Attachment: NovelAnalyzer.java

Updated for new 3.0 APIs

 NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all 
 boilerplate text
 ---

 Key: LUCENE-725
 URL: https://issues.apache.org/jira/browse/LUCENE-725
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mark Harwood
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: NovelAnalyzer.java, NovelAnalyzer.java, 
 NovelAnalyzer.java


 This is a class I have found to be useful for analyzing small (in the 
 hundreds) collections of documents and  removing any duplicate content such 
 as standard disclaimers or repeated text in an exchange of  emails.
 This has applications in sampling query results to identify key phrases, 
 improving speed-reading of results with similar content (eg email 
 threads/forum messages) or just removing duplicated noise from a search index.
 To be more generally useful it needs to scale to millions of documents - in 
 which case an alternative implementation is required. See the notes in the 
 Javadocs for this class for more discussion on this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-11-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782521#action_12782521
 ] 

Mark Harwood commented on LUCENE-1486:
--

Ugh. There's probably two separate actions required here then:
1) a bug needs raising on Lucene.
2) guidance needed from the Solr team about preferred course of action


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.1

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
 field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1999) Match spotter for all query types

2009-10-21 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768257#action_12768257
 ] 

Mark Harwood commented on LUCENE-1999:
--

bq. and 2) you need it for every single doc visited by the query

Actually I don't need it for every doc, only the top ones  - it just happens to 
be so cheap to produce that I can afford to run this in-line with the query. (I 
haven't actually benchmarked it at scale buy my gut feel is it would be fast )

I was thinking that this might be orthogonal to the existing free-text based 
highlighter. The logic for this being roughly that

1) Highlighting of free-text fields is reasonably well-catered for with 
summarisation etc.
2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, 
Cached term filters on enums eg gender:male/female) are all likely to be 
non-free-text fields which don't require summarisation and only contain a 
single value.

I may be wrong in these assumptions about the existing state of play (any 
thoughts, Mark M?) but it might be useful to think of attacking the problem 
with these 2 different requirements in mind.

Regardless of type e.g. int, long etc I tend to think of fields as falling into 
these broad usage categories:

a) Identifiers (e.g. primary keys)
b) Quantifiers (e.g numerics, dates, spatial)
c) Free-text 
d) Controlled vocabularies (e.g. enums such as gender:m/f)

Type a ) is catered for with a straight TermQuery and therefore can be handled 
with the existing highlighter
Type b) needs special indexes/queries (spatial/trie) and isn't catered for by 
the existing term/span-based Highlighter
Type c) is catered for with the existing highlighter and its summarising 
features
Type d) involves many TermDoc.next reads so is usefully cached as filters and 
therefore not catered for by existing Highlighter

So this patch helps cater for types b) and d) where simply knowing the field 
matched is all that is required to highlight.


 Match spotter for all query types
 -

 Key: LUCENE-1999
 URL: https://issues.apache.org/jira/browse/LUCENE-1999
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.9
Reporter: Mark Harwood
 Attachments: matchflagger.patch


 Related to LUCENE-1929 and the current inability to highlight 
 NumericRangeQuery, spatial, cached term filters and other exotica.
 This patch provides the ability to wrap *any* Query objects and record match 
 info as flags encoded in the overall document score.
 Using this approach it would be possible to understand (and therefore 
 highlight) which fields matched clauses in a query.
 The match encoding approach loses some precision in scores as noted here: 
 http://tinyurl.com/ykt8nx7
 Avoiding these precision issues would require a change to Lucene core to 
 record docId, score AND a matchFlag byte in ScoreDoc objects and collector 
 APIs.
 This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1999) Match spotter for all query types

2009-10-20 Thread Mark Harwood (JIRA)
Match spotter for all query types
-

 Key: LUCENE-1999
 URL: https://issues.apache.org/jira/browse/LUCENE-1999
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.9
Reporter: Mark Harwood
 Attachments: matchflagger.patch

Related to LUCENE-1929 and the current inability to highlight 
NumericRangeQuery, spatial, cached term filters and other exotica.

This patch provides the ability to wrap *any* Query objects and record match 
info as flags encoded in the overall document score.
Using this approach it would be possible to understand (and therefore 
highlight) which fields matched clauses in a query.

The match encoding approach loses some precision in scores as noted here: 
http://tinyurl.com/ykt8nx7

Avoiding these precision issues would require a change to Lucene core to record 
docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
This may be something we should consider.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1999) Match spotter for all query types

2009-10-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1999:
-

Attachment: matchflagger.patch

 Match spotter for all query types
 -

 Key: LUCENE-1999
 URL: https://issues.apache.org/jira/browse/LUCENE-1999
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.9
Reporter: Mark Harwood
 Attachments: matchflagger.patch


 Related to LUCENE-1929 and the current inability to highlight 
 NumericRangeQuery, spatial, cached term filters and other exotica.
 This patch provides the ability to wrap *any* Query objects and record match 
 info as flags encoded in the overall document score.
 Using this approach it would be possible to understand (and therefore 
 highlight) which fields matched clauses in a query.
 The match encoding approach loses some precision in scores as noted here: 
 http://tinyurl.com/ykt8nx7
 Avoiding these precision issues would require a change to Lucene core to 
 record docId, score AND a matchFlag byte in ScoreDoc objects and collector 
 APIs.
 This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-05 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762290#action_12762290
 ] 

Mark Harwood commented on LUCENE-1910:
--

 2 minutes to create a query based on 10,000 documents?

Unfortunately, I can't see this being generally useful until the performance is 
improved dramatically.


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-21 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757924#action_12757924
 ] 

Mark Harwood commented on LUCENE-1910:
--

Hi Thomas,
Following your request for feedback, some initial thoughts from a very quick 
look.

* The Information Gain algo could use a little more explanation e.g. using 
variable names other than num1 and num2 and could perhaps be extracted into 
a utility class

* Is this scalable? It looks like in initialize it is loading this:
{code:title=MoreLikeThisUsingTags.java|borderStyle=solid}
/**
  * All terms in the index
  */
protected HashSet docTerms=new HashSet();
{code} 
..that seems a little scary!
It's also doing a seperate BooleanQuery for all items in this list ( and 
repeated for 1 tag?). Thats look like a lot of searches.

I need to spend a little more time looking at it before I understand it in more 
detail.
Before then - have you tested this on a big (millions of docs/terms) index? 
Some performance figures would be useful to accompany this.

Cheers,
Mark


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-08-26 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748046#action_12748046
 ] 

Mark Harwood commented on LUCENE-1486:
--

It does not stand on it's own as it is merely a temporary object used as a 
peculiarity in the way the parsing works. The SpanQuery family would be the 
legitimate standalone equivalents of this class.

ComplexPhraseQuery objects are constructed during the the first pass of parsing 
to capture everything between quotes as an opaque string.
The ComplexPhraseQueryParser then calls parsePhraseElements(...) on these 
objects to complete the process of parsing in a second pass where in this 
context any brackets etc take on a different meaning
There is no merit in making this externally visible.





 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.0, 3.1

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
 field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-08-17 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: ActivityTimeMonitor.java

Had another run at ActivityTimeMonitor tonight and rationalised the code based 
on earlier comments. It should now cater for multiple simultaneous timeouts 
more cleanly.

I'm concentrating on robustness with this currently - there's a TODO comment in 
the code that captures a small remaining inefficiency in iterating through all 
threads' data rather than using some form of time-sorted list. There was a 
suggestion in the earlier Jira comments re TreeMap might be a simple 
alternative but see my Java code comments as to why this is unlikely to work. 
Optimising this is likely to require the introduction of yet another data 
structure but this will add a runtime cost to maintain it - a cost I'm not sure 
is justified.



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737270#action_12737270
 ] 

Mark Harwood commented on LUCENE-1486:
--

No objections to pulling from core given the impending deprecation of the 
QueryParser base class.

I know of at least 2 folks using it so moving it to contrib would help provide 
somewhere to maintain fixes while we wait for the new QueryParser to 
incorporate the complex phrase features.

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
 field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-24 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: Lucene-1486 non default field.patch

Fix for phrases using QueryParser's non-default field e.g. 
 author:j* smith


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
 field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734148#action_12734148
 ] 

Mark Harwood commented on LUCENE-1486:
--

I'll try and catch up with some of the issues raised here:

bq. What do you mean on the last check by phrase inside phrase, I don't see any 
phrase inside a phrase

Correct, the inner phrase example was a term not a phrase. This is perhaps a 
better example:

checkBadQuery(\jo* \percival smith\ \); //phrases inside 
phrases is bad

bq. I'm trying now to figure out what is supported 

The Junit is currently the main form of documentation - unlike the 
XMLQueryParser (which has a DTD) there is no syntax to formally capture the 
logic. 
Here is a basic summary of the syntax supported and how it differs from normal 
non-phrase use of the same operators:

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given 
phrase element  e.g. (john OR jonathon) smith 
* AND is irrelevant - there is effectively an implied AND_NEXT_TO binding 
all phrase elements 

To move this forward I would suggest we consider following one of these options:

1) Keep in core and improve error reporting and documentation
2) Move into contrib as experimental 
3) Retain in core but simplify it to support only the simplest syntax (as in my 
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable within phrase operators e.g. *, ~, ( ) 

I think 1) is achievable if we carefully define where the existing parser 
breaks (e.g. ANDs and nested brackets)
2) is unnecessary if we can achieve 1).
3) would be a shame if we lost useful features for some very convoluted edge 
cases
4) is beyond my JavaCC skills.



















 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734176#action_12734176
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Hey Mark. Have you made any progress with that?

Apologies, recently the lure of developing apps for the new iPhone has put paid 
to that :)

I'm still happy that the pseudo-code we outlined in the last couple of comments 
is what is needed to finish this.

bq.We can tag team if you want 

Yep, happy to do that. Let me know if you start work to avoid me duplicating 
effort and I'll do the same.

Cheers
Mark



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734337#action_12734337
 ] 

Mark Harwood commented on LUCENE-1486:
--

bq. I think it's not a big deal, but I'm just trying to understand and raise a 
probable wrong test.

Granted, the test fails for a reason other than the one for which I wanted it 
to fail. 
We can probably strike the test and leave a note saying phrase-within-a-phrase 
just does not make sense and is not supported.

bq.  Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or 
the default boolean operator (usually OR)?

In brackets it's an OR - the brackets are used to suggest that the current 
phrase element at position X is composed of some choices that are evaluated as 
a subclause in the same way that in normal query logic sub-clauses are defined 
in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on 
while evaluating the bracketed innards of phrases just in case the base class 
has AND as the default.

bq. Mark H, can you please elaborate more on the these other operators + - 
^ AND  || NOT ! : [ ] { }.

OK I'll try and deal with them one by one but these are not necessarily 
definitive answers or guarantees of correctly implemented support

OR,||,+, AND,  . ignored. The implicit operator is AND_NEXT_TO apart from 
in bracketed sections where all elements at this level are ORed
^ .boosts are carried through from TermQuerys to SpanTermQuerys
NOT, ! Creates SpanNotQueries 
[]{} range queries are supported as are wildcards *, fuzzies  ~, ?

bq. query: '(john OR jonathon) smith~0.3 order*' order:sell stock market


I'll post the XML query syntax equivalent of what should be parsed here shortly 
(just seen your next comment come in) 





 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734349#action_12734349
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}for test checkMatches(\(jo* -john) smyth\, 2); 
would document 5 be returned or just doc 2 should be returned,
{quote}

I presume you mean smith not smyth here otherwise nothing would match? If so, 
doc 5 should match and position is relevant (subject to slop factors).

{quote}
Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work
{quote}

I suppose there's an open question as to if the second example is legal (the 
brackets are unnecessary)



{quote}
Question 3)
checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.
{quote}

That looks like a bug related to slop factor?

{quote}
Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with
{quote}
ANDs are ignored and turned into ORs (see earlier comments) but maybe a query 
parse error should be thrown to emphasise this.





 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734355#action_12734355
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}
query: '(john OR jonathon) smith~0.3 order*' order:sell stock market
{quote}
Would be parsed as follows (shown as equivalent XMLQueryParser syntax)
{code:xml} 
BooleanQuery
  Clause occurs=should
 SpanNear 
SpanOr
SpanOrTermsjohn jonathon /SpanOrTerms
/SpanOr
SpanOr
SpanOrTerms smith smyth/SpanOrTerms
/SpanOr
SpanOr
SpanOrTerms order orders/SpanOrTerms
/SpanOr
   /SpanNear
 /Clause
Clause occurs=should
 TermQuery fieldName=order sell/TermQuery 
 /Clause
Clause occurs=should
 UserQuerystock market/UserQuery  
 /Clause
/BooleanQuery 
{code}


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-06 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727685#action_12727685
 ] 

Mark Harwood commented on LUCENE-1486:
--

Hi Mark,
Mind if I try committing this patch?
I've just switched from PC to Mac and my dev environment is all changed 
(Subclipse vs TortoiseSvn etc) and I wouldn't mind checking my config and 
commit rights still work in this new environment.
If anyone has any  mac/subclipse-related gotchas I should be aware of, do let 
me know. 

Cheers
Mark

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-06 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-1486.


Resolution: Fixed

Committed in 791579 -  http://svn.apache.org/viewvc?rev=791579view=rev

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-07-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726064#action_12726064
 ] 

Mark Harwood commented on LUCENE-1720:
--

re points 1,2,3  - yep, will change.

re the question  - yes, TimeoutThread should call the existing 
resetFirstAnticipatedFailure() method to advance timeout monitoring 
immediately to the next candidate - it currently requires the first bad Thread 
to call stop() before monitoring is advanced to spot the next bad thread.

I think a useful safety measure is to manage clients that don't call stop() 
(e.g. forgetting to code a finally...stop) but this is likely to add 
complexity to ActivityTimeMonitor so I want to get a basic version solid first 
before thinking too much about this.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-07-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726128#action_12726128
 ] 

Mark Harwood commented on LUCENE-1720:
--

Maybe we should start by debugging some guiding principles:

1) There is a holding list of active threads that are of indeterminate status
2) There is a list of threads that are known to have timed out
3) The monitoring thread has the job of moving items from 1) to 2) and waits 
for firstAnticipatedTimeout and is notify-ed if firstAnticipatedTimeout changes
4) Start() adds a thread to 1)
5) Stop() removes a thread from 1) or 2)
6) Check() throws an exception if anActivityHasTimedOut  is true (for fast 
fail) and current thread is in 2)
7) Any modification to 2) should set anActivityHasTimedOut boolean flag =  2)'s 
size is 0. 
8) Any modification to 1) should re-asses firstAnticipatedTimeout and notify 3) 
if changed



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725500#action_12725500
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Maybe we can benchmark this approach

See 
http://www.nabble.com/Improving-TimeLimitedCollector-td24174758.html#a24229185
The figures were produced by TestTimeLimitedIndexReader that is part of this 
Jira issue so you can try benchmarks on your own indexes.

bq.if it slows down queries due to the the Thread.currentThread and hash lookup

This lookup only happens when threads start or stop timed activities and when 
there is a timed out state - all other method invocations on 
TimeLimitedIndexReader eg termDocs.next() are simply testing a volatile boolean 
which is used to indicate if any timeout has occurred. This seems to be fast in 
my benchmarks.

bq. maybe we can .. change the Lucene API such that we pass in an argument to 
the IndexReader methods where the timeout may be checked 

The current design uses static methods which remove the need to pass a timeout 
object as context everywhere but using this approach comes with the downside 
that a single client thread is unable to time 1 activity at once which we 
thought was a reasonable trade-off. See 
http://www.nabble.com/Re%3A-Improving-TimeLimitedCollector-p24234976.html

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725741#action_12725741
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Ah, so we're assuming most actions don't timeout

Yes, that's it.

bq. (I'll volunteer to do the latter).

Cool. I'll work on tidying up the classes under test as per comments earlier .

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-30 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: (was: ActivityTimeMonitor.java)

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-30 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: ActivityTimeMonitor.java

Updated to allow 1 simultaneous timeout error to be handled

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725164#action_12725164
 ] 

Mark Harwood commented on LUCENE-1720:
--

Currently the class hinges on a fast fail mechanism whereby all the many 
calls checking for a timeout are very quickly testing a single volatile 
boolean, anActivityHasTimedOut.
99.99% of calls are expected to fail this test (nothing has timed out) and fail 
quickly - I was reluctant to add any hashset lookup etc in there needed to 
determine failure.

With that as a guiding principle maybe the solution is to change
volatile boolean anActivityHasTimedOut
into
volatile int numberOfTimedOutThreads;

which would cater for 1 error condition at once. The fast-fail check then 
becomes:
if(numberOfTimedOutThreads  0)
{
 if(timedoutThreads.contains(Thread.currentThread)
 { 
timedoutThreads.remove(Thread.currentThread);
numberOfTimedOutThreads=timedoutThreads.size();
throw RuntimeException.
 }
   }




 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725176#action_12725176
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Oh, I did not mean to skip this check.

But the check is on a variable with a yes/no state. We need to cater for 1 
simultaneous timeout error condition in play. With only a boolean it could be 
hard to know precisely when to clear it, no?

bq. Mark here wanted to provide a much more generalized way of stopping any 
other activity, not just search

To be fair I think the use case for IndexWriter is weaker. In reader you have 
multiple users all expressing different queries and you want them all to share 
nicely with each other. In index writing it's typically a batch system indexing 
docs and there's no fairness to mediate? Breaking it out into a utility class 
seems like a good idea anyway.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725197#action_12725197
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. any custom Scorer which does a lot of work, but uses IndexReader for that, 
will be stopped, even if the Scorer's developer did not implement a Timeout 
mechanism. Right?

Correct. I'm not familiar with the proposal to pass around a Timeout object but 
I get the idea and the code here would certainly avoid that overhead.

bq. We can cleat it when the time out threads' Set's size() is 0?

Yes, that would work.


 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-28 Thread Mark Harwood (JIRA)
TimeLimitedIndexReader and associated utility class
---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor


An alternative to TimeLimitedCollector that has the following advantages:

1) Any reader activity can be time-limited rather than just single searches 
e.g. the document retrieve phase.
2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
before last collect stage of query processing)

Uses new utility timeout class that is independent of IndexReader.

Initial contribution includes a performance test class but not had time as yet 
to work up a formal Junit test.
TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: ActivityTimedOutException.java

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1720:
-

Attachment: TimeLimitedIndexReader.java
TestTimeLimitedIndexReader.java
ActivityTimeMonitor.java

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-24 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: LUCENE-1486.patch

Added fix for ConstantScoreQuery changes

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, 
 LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-24 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723742#action_12723742
 ] 

Mark Harwood commented on LUCENE-1486:
--

The fix was relatively straight-forward from what I could see. Just temporarily 
unset the QueryParser's ConstantScoreRewrite mode when performing the pass that 
is just evaluating query elements inside phrase queries. These clauses need to 
resolve to traditional BooleanQuery-full-of-termQueries in order that they can 
be inspected and rewritten as Span equivalents for complex phrases.

Should do the job.

Cheers
Mark
(Been far too busy with other things and missing getting my hands dirty here 
with Lucene!)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, 
 LUCENE-1486.patch, TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-13 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719115#action_12719115
 ] 

Mark Harwood commented on LUCENE-1486:
--

The primary reason (and perhaps not a particularly good one) was I didn't want 
to wade around in the Javacc syntax of the .jj file that generates the 
QueryParser and the required extensions could be made in a subclass.

Also there is invariably a performance hit for supporting things like wildcards 
in phrase queries so rather than adding another off by default flag in the 
main parser  and conditional logic to test if wildcards etc in phrases are 
allowed, the subclass could be seen as a specialised extension that is to be 
used by those that understand the trade-offs between functionality and 
performance.  

I can sympathise with the purist approach of having all parser syntax defined 
in Javacc though.

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718573#action_12718573
 ] 

Mark Harwood commented on LUCENE-1486:
--

Perhaps hacky was too strong a word. I think it's a reasonable approach to 
handling the complexity involved in this logic. 

A colleague of mine has this running in production on a big installation with 
lots of users

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-04-28 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703733#action_12703733
 ] 

Mark Harwood commented on LUCENE-1621:
--

While we're poking around in this area I'd like to point out the long-standing 
open issue in LUCENE-329.

Matching Smyth over Smith when doing a search for Smith~ is just plain 
broken but this is what I see all the time with FuzzyQuery and it's default 
approach to IDF. I think we need to take the sort of logic in contrib's 
FuzzyLikeThisQuery to address this. 

 deprecate term and getTerm in MultiTermQuery
 

 Key: LUCENE-1621
 URL: https://issues.apache.org/jira/browse/LUCENE-1621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1621.patch


 This means moving getTerm and term up to sub classes as appropriate and 
 reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-1500.
--

Resolution: Fixed
  Assignee: Mark Harwood  (was: Mark Harwood)

Committed in revision: 758460 

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Mark Harwood
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, 
 Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-13 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1500:
-

Attachment: Lucene-1500-NewException.patch

With updated Apache license header.

I'll commit soon if no objections

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Mark Harwood
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, 
 Lucene-1500-NewException.patch, LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681287#action_12681287
 ] 

Mark Harwood commented on LUCENE-1559:
--

Sorry to be picky but can you submit a self-contained test with no external 
dependencies other than Lucene+Highlighter+JUnit

I don't want POI versions to be a factor here.

Cheers
Mark

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681323#action_12681323
 ] 

Mark Harwood commented on LUCENE-1559:
--

Your code still imports POI and is now importing a .DOC file without parsing, 
producing garbage.

You'll need to supply an example Junit which illustrates this problem with 
plain text before we can look at it.

You should be able to turn the .Doc into text at your end using POI and then 
supply the file.

Are you sure there isn't a problem with POI failing to parse the file 
correctly? 


 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681336#action_12681336
 ] 

Mark Harwood commented on LUCENE-1559:
--

Can I close this then as it appears to be an issue with your parser, not Lucene?

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681344#action_12681344
 ] 

Mark Harwood commented on LUCENE-1559:
--

Sorry...I don't know what I should do at this stage

Give us a Junit example of your problem code when working with plain text (Not 
PDF, word or .doc) that clearly demonstrates where Lucene fails to index/search 
or highlight this text correctly.

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681466#action_12681466
 ] 

Mark Harwood commented on LUCENE-1559:
--

I ran a quick test and I dont  think I could see document in the 
Token.termText() of any tokens in the TokenStream you provide to the 
Highlighter.

It's late and I need to be elsewhere but if you have time to pursue this check 
the above statement is true.
If so, check the body text retrieved from Document.get(body) in the search 
results  is the same as the String you store at index time (just in case the 
act of storing/retrieving has altered the text somehow).

Will look into this more later

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681507#action_12681507
 ] 

Mark Harwood commented on LUCENE-1559:
--

Ah. Try set this

highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-1559.


Resolution: Invalid

Working as designed with feature designed to prevent too-costly analysis

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681531#action_12681531
 ] 

Mark Harwood commented on LUCENE-1522:
--

I'm guessing that's not an issue given it uses stored TermVectors rather than 
re-analyzing?

At some stage I hope to take a closer look at this contribution.  I'd be 
interested to see if all the Highlighter1  Junit tests could be adapted to work 
with Highlighter2 and get some comparative benchmarks.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-09 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680125#action_12680125
 ] 

Mark Harwood commented on LUCENE-1500:
--

Will submit a new patch tonight.

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Mark Harwood
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-09 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1500:
-

Attachment: (was: Lucene-1500-NewException.patch)

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Mark Harwood
 Fix For: 2.9

 Attachments: LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-09 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1500:
-

Attachment: Lucene-1500-NewException.patch

Added support for testing both Token start or end offset text.length.

Added javadoc comments for new exception

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Mark Harwood
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677956#action_12677956
 ] 

Mark Harwood commented on LUCENE-1500:
--

My thoughts were that this exception solely traps inconsistencies with Tokens 
in relation to a particular provided chunk of text.

I think internal inconsistencies within a Token (e.g. endOffset startOffset) 
should ideally be handled by Token (throwing something like an 
IllegalArgumentException in it's constructor).
I guess an open question there is can startOffset=endOffset in a Token? Either 
way, String.substring simply returns an empty string so I think that's probably 
OK in highlighter.


 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677968#action_12677968
 ] 

Mark Harwood commented on LUCENE-1500:
--

Isn't your example predicated on being given an invalid Token with endstart?

What did you think of my suggestion to fix this problem at it's source - i.e. 
Token should never be in a state with endstart in the first place?

Acheiving this goal is complicated by the fact that offsets are not only set in 
the constructor - there are independent set methods for start and end offsets 
which can be called in any order.
One solution would be to deprecate Token.setStartOffset and Token.endOffset and 
replacing with a Token.setExtent(int startOffset, int endOffset) with the 
appropriate checks.





 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677991#action_12677991
 ] 

Mark Harwood commented on LUCENE-1500:
--

I struggle to see why endOffsetstartOffset should ever be acceptable but also 
share your concerns about the disruption of changing the Token API to enforce 
this.

So, I'll add code to the patch to check for bad startOffsets too. If we had 
more points of use for Token offsets outside of highlighting I'd be more 
concerned, but things being the way they are this seems like the most pragmatic 
option.

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-03-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: ComplexPhraseQueryParser.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-03-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: ComplexPhraseQueryParser.java

Updated to cater for phrase clauses that produce no matches

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-03-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: TestComplexPhraseQuery.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-03-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: TestComplexPhraseQuery.java

Updated Junit test to test for phrases with clauses that produce no matches

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-03-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1500:
-

Attachment: Lucene-1500-NewException.patch

Attached a patch with new checked exception.
This will have a knock-on effect on all Highlighter client code (Solr?) as it 
introduces a new checked exception that must be handled.

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: Lucene-1500-NewException.patch, LUCENE-1500.patch, 
 patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-02-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677507#action_12677507
 ] 

Mark Harwood commented on LUCENE-1500:
--

OK - choices are:

1) Throw a RuntimeException with a more useful diagnostic message
2) Throw a new checked Exception (introducing possible compile errors in 
existing client code)
3) Check for the error condition and ignore (as done in the current patch)

This feels to me like one of those there's something seriously wrong with the 
codebase problems rather than an invalid bit of data or user input which is 
external to the system so my personal preference is to lean towards 1). 



 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-02-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676633#action_12676633
 ] 

Mark Harwood commented on LUCENE-1500:
--

Hmmm. I'm not so sure that this defensive coding patch is the right thing to 
do here. 

One could argue that it is obscuring an error condition further upstream (as 
you suggest, Mike - a dodgy analyzer). Commiting this will only make these 
errors harder to detect e.g. we'd get forum posts saying why doesn't my term 
get highlighted?

Perhaps we can turn this around and ask under what conditions is it acceptable 
to provide a TokenStream with tokens whose offsets exceed the length of the 
text provided?. 
Not sure I see a justifiable case for supporting that as a legitimate scenario 
and I would prefer the reporting of an error in this case.




 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1500) Highlighter throws StringIndexOutOfBoundsException

2009-02-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676745#action_12676745
 ] 

Mark Harwood commented on LUCENE-1500:
--

So to be consistent, where else in Lucene might an 
IncorrectTokenOffsetsException be a possibility - IndexWriter.addDocument(..)?

 Highlighter throws StringIndexOutOfBoundsException
 --

 Key: LUCENE-1500
 URL: https://issues.apache.org/jira/browse/LUCENE-1500
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4
 Environment: Found this running the example code in Solr (latest 
 version).
Reporter: David Bowen
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: LUCENE-1500.patch, patch.txt


 Using the canonical Solr example (ant run-example) I added this document 
 (using exampledocs/post.sh):
 adddoc
   field name=idTest for Highlighting 
 StringIndexOutOfBoundsExcdption/field
   field name=nameSome Name/field
   field name=manuAcme, Inc./field
   field name=featuresDescription of the features, mentioning various 
 things/field
   field name=featuresFeatures also is multivalued/field
   field name=popularity6/field
   field name=inStocktrue/field
 /doc/add
 and then the URL 
 http://localhost:8983/solr/select/?q=featureshl=truehl.fl=features caused 
 the exception.
 I have a patch.  I don't know if it is completely correct, but it avoids this 
 exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

2009-01-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667654#action_12667654
 ] 

Mark Harwood commented on LUCENE-1489:
--

It looks to me like this could be fixed in the Formatter classes when marking 
up the output string.

Currently classes such as SimpleHTMLFormatter in their highlightTerm method 
put a tag around the whole section of text, if it contains a hit, i.e.

{code:title=SimpleHTMLFormatter.java|borderStyle=solid}
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
StringBuffer returnBuffer;
if(tokenGroup.getTotalScore()0)
{
returnBuffer=new StringBuffer();
returnBuffer.append(preTag);
returnBuffer.append(originalText);
returnBuffer.append(postTag);
return returnBuffer.toString();
}
return originalText;
}
{code}

The TokenGroup object passed to this method contains all of the tokens and 
their scores so it should be possible to use this information to deconstruct 
the originalText parameter and inject markup according to which tokens in the 
group had a match rather than putting a tag around the whole block.  Some 
complexity may lie in handling token streams that produce tokens that rewind 
to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!

TokenStreams that produce entirely overlapping streams of tokens will 
automatically be broken into multiple TokenGroups because TokenGroup has a 
maximum number of linked Tokens it will ever hold in a single group.

I haven't got the time to fix this right now but if someone has a burning need 
to leap in, the above seems like what may be required.

Cheers
Mark






 highlighter problem with n-gram tokens
 --

 Key: LUCENE-1489
 URL: https://issues.apache.org/jira/browse/LUCENE-1489
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor

 I have a problem when using n-gram and highlighter. I thought it had been 
 solved in LUCENE-627...
 Actually, I found this problem when I was using CJKTokenizer on Solr, though, 
 here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) 
 instead of CJKTokenizer:
 {code:java}
 public class TestNGramHighlighter {
   public static void main(String[] args) throws Exception {
 Analyzer analyzer = new NGramAnalyzer();
 final String TEXT = Lucene can make index. Then Lucene can search.;
 final String QUERY = can;
 QueryParser parser = new QueryParser(f,analyzer);
 Query query = parser.parse(QUERY);
 QueryScorer scorer = new QueryScorer(query,f);
 Highlighter h = new Highlighter( scorer );
 System.out.println( h.getBestFragment(analyzer, f, TEXT) );
   }
   static class NGramAnalyzer extends Analyzer {
 public TokenStream tokenStream(String field, Reader input) {
   return new NGramTokenizer(input,2,2);
 }
   }
 }
 {code}
 expected output is:
 Lucene Bcan/B make index. Then Lucene Bcan/B search.
 but the actual output is:
 Lucene Bcan make index. Then Lucene can/B search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: TestComplexPhraseQuery.java

More tests for Nots

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: ComplexPhraseQueryParser.java

Added support for Nots in phrase queries e.g. -not interested

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: ComplexPhraseQueryParser.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: TestComplexPhraseQuery.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: ComplexPhraseQueryParser.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: ComplexPhraseQueryParser.java

Fixed bug with plain phrase query, added support for range queries

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: TestComplexPhraseQuery.java)

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: TestComplexPhraseQuery.java

Added tests for range queries and plain PhraseQueries

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-10 Thread Mark Harwood (JIRA)
Wildcards, ORs etc inside Phrase queries


 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1


An extension to the default QueryParser that overrides the parsing of 
PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.

The implementation feels a little hacky - this is arguably better handled in 
QueryParser itself. This works as a proof of concept  for much of the query 
parser syntax. Examples from the Junit test include:

checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
are OK in phrases
checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
works
checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
works.

checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
phrase is bad
checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
is bad
checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
inside phrases not supported

Code plus Junit test to follow...



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-10 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: ComplexPhraseQueryParser.java

QueryParser extension

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-10 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: TestComplexPhraseQuery.java

Junit test

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.4.1

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-03 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653057#action_12653057
 ] 

Mark Harwood commented on LUCENE-1473:
--

The contrib section of Lucene contains an XML-based query parser which aims to 
provide full-coverage of Lucene queries/filters and provide extensibility to 
support 3rd party classes.
I use this regularly in distributed deployments and this allows both non-Java 
clients and long-term persistence of queries with good stability across Lucene 
versions.
Although I have not conducted formal benchmarks I have not been drawn to XML 
parsing as a bottleneck - search execution and/or document retrieves are 
normally the main bottlenecks.

Maintaining XML parsing code is an overhead but ultimately helps decouple 
requests from the logic that executes requests. In serializing Lucene 
Query/Filter objects we are dealing with the classes which combine both the 
representation of the request criteria (what needs to be done) and the 
implementation (how things are done). We are forever finessing the how bit of 
this equation e.g. moving from RangeQuery to RangeFilters to TrieRangeFilter. 
The criteria however remains relatively static ( I just want to search on a 
range) and so it is dangerous to build clients that refer tdirectly to query 
implementation classes.
The XML parser provides a language-independent abstraction for clients to 
define what they want to be done without being too tied to how this is 
implemented.

Cheers
Mark



 Implement standard Serialization across Lucene versions
 ---

 Key: LUCENE-1473
 URL: https://issues.apache.org/jira/browse/LUCENE-1473
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: LUCENE-1473.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 To maintain serialization compatibility between Lucene versions, 
 serialVersionUID needs to be added to classes that implement 
 java.io.Serializable.  java.io.Externalizable may be implemented in classes 
 for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

2008-11-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651418#action_12651418
 ] 

Mark Harwood commented on LUCENE-1470:
--

A note of caution - I noticed when moving from Lucene 2.3 to 2.4 that my 
similar scheme for encoding information meant that I couldn't encode 
information using byte arrays using bytes with values  216.
The changes (I think in Lucene-510) introduced some code that modified the way 
the bytes were written/read and corrupted my encoding.

Not sure if your proposed approach is prone to this or if anyone can cast 
further light on these encoding issues.

Good to see this making its way into Lucene, Uwe.

 Add TrieRangeQuery to contrib
 -

 Key: LUCENE-1470
 URL: https://issues.apache.org/jira/browse/LUCENE-1470
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.4
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
 LUCENE-1470.patch


 According to the thread in java-dev 
 (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
 http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
 include my fast numerical range query implementation into lucene 
 contrib-queries.
 I implemented (based on RangeFilter) another approach for faster
 RangeQueries, based on longs stored in index in a special format.
 The idea behind this is to store the longs in different precision in index
 and partition the query range in such a way, that the outer boundaries are
 search using terms from the highest precision, but the center of the search
 Range with lower precision. The implementation stores the longs in 8
 different precisions (using a class called TrieUtils). It also has support
 for Doubles, using the IEEE 754 floating-point double format bit layout
 with some bit mappings to make them binary sortable. The approach is used in
 rather big indexes, query times are even on low performance desktop
 computers 100 ms (!) for very big ranges on indexes with 50 docs.
 I called this RangeQuery variant and format TrieRangeRange query because
 the idea looks like the well-known Trie structures (but it is not identical
 to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2008-11-13 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647243#action_12647243
 ] 

Mark Harwood commented on LUCENE-329:
-

This patch goes back a while.
Contrib's FuzzyLikeThisQuery contains my current best practice for fuzzy 
matching but the logic is mixed in with code that also does LikeThis 
optimisations ie working out which input terms are the best to search on rather 
than using all input terms. This could usefully be lifted out and used 
elsewhere.

The fuzzy scoring logic takes the IDF of the input term and uses that as the 
IDF for scoring all expanded variants. If the input term does not exist then 
all variants are rewarded with their averaged IDF. Coord is disabled.

Using some form of IDF is typically desirable to balance a fuzzy query with 
other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy 
query (or wildcard or other auto-expanding queries) however I see no reason to 
differentiate between the auto-expanded terms with different IDF values. In my 
view these auto-expand queries should generally use the same IDF for all 
variants and only reward them differently based on edit distance or what other 
distance metric is meaningful to that form of expansion (e.g. age range query 
on age 40 +/- 10 years could reward based on closeness to input term 40).

Cheers
Mark

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2008-11-13 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647243#action_12647243
 ] 

Mark Harwood commented on LUCENE-329:
-

This patch goes back a while.
Contrib's FuzzyLikeThisQuery contains my current best practice for fuzzy 
matching but the logic is mixed in with code that also does LikeThis 
optimisations ie working out which input terms are the best to search on rather 
than using all input terms. This could usefully be lifted out and used 
elsewhere.

The fuzzy scoring logic takes the IDF of the input term and uses that as the 
IDF for scoring all expanded variants. If the input term does not exist then 
all variants are rewarded with their averaged IDF. Coord is disabled.

Using some form of IDF is typically desirable to balance a fuzzy query with 
other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy 
query (or wildcard or other auto-expanding queries) however I see no reason to 
differentiate between the auto-expanded terms with different IDF values. In my 
view these auto-expand queries should generally use the same IDF for all 
variants and only reward them differently based on edit distance or what other 
distance metric is meaningful to that form of expansion (e.g. age range query 
on age 40 +/- 10 years could reward based on closeness to input term 40).

Cheers
Mark

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1449) IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation

2008-11-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1449:
-

Attachment: TestTransactionRollbackCapability.java

Junit test

 IndexDeletionPolicy.delete behaves incorrectly when deleting latest 
 generation 
 ---

 Key: LUCENE-1449
 URL: https://issues.apache.org/jira/browse/LUCENE-1449
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Mark Harwood
Priority: Minor
 Attachments: TestTransactionRollbackCapability.java


 I have been looking to provide the ability to rollback committed transactions 
 and encountered some issues.
 I appreciate IndexDeletionPolicy's main motivation is to handle cleaning away 
 OLD commit points but it does not explicitly state that it can or cannot be 
 used to clean NEW commit points.
 If this is not supported then the documentation should ideally state this. If 
 the intention is to support this behaviour then read on ...
 There seem to be 2 issues so far:
 1) The first attempt to call IndexCommit.delete on the latest commit point 
 fails to remove any contents. The subsequent call succeeds however
 2) Deleting the latest commit point fails to update the segments.gen file to 
 point to segments_N-1. New IndexReaders that are opened are then misdirected 
 to open segments_N which has been deleted
 Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1449) IndexDeletionPolicy.delete behaves incorrectly when deleting latest generation

2008-11-11 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1449:
-

Attachment: TestTransactionRollbackCapability2.java

Thanks for the pointers, Mike.

This new test now passes having made a few changes.

 IndexDeletionPolicy.delete behaves incorrectly when deleting latest 
 generation 
 ---

 Key: LUCENE-1449
 URL: https://issues.apache.org/jira/browse/LUCENE-1449
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Mark Harwood
Assignee: Michael McCandless
Priority: Minor
 Attachments: TestTransactionRollbackCapability.java, 
 TestTransactionRollbackCapability2.java


 I have been looking to provide the ability to rollback committed transactions 
 and encountered some issues.
 I appreciate IndexDeletionPolicy's main motivation is to handle cleaning away 
 OLD commit points but it does not explicitly state that it can or cannot be 
 used to clean NEW commit points.
 If this is not supported then the documentation should ideally state this. If 
 the intention is to support this behaviour then read on ...
 There seem to be 2 issues so far:
 1) The first attempt to call IndexCommit.delete on the latest commit point 
 fails to remove any contents. The subsequent call succeeds however
 2) Deleting the latest commit point fails to update the segments.gen file to 
 point to segments_N-1. New IndexReaders that are opened are then misdirected 
 to open segments_N which has been deleted
 Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   >