Re: QueryParser - proposed change may break existing queries.

2020-09-18 Thread Mark Harwood
>You could avoid (some of?) these problems by supporting /(?i)foo/ instead
of /foo/i

That would avoid our parsing dilemma but brings some other concerns. This
inline syntax can normally be used to selectively turn on case sensitivity
for sections of a regex and then turn it off with (?-i).
We could potentially implement this support in the
underlying o.a.l.util.automaton.RegExp class. We changed that class
recently to take a separate global flag alongside the regex string which
can determine case sensitivity. I guess any inline (?i) syntax would
override whatever default option had been passed in the constructor flag.
That might be a hairy change though - the RegExp parser logic is
hand-crafted rather than JavaCC.


On Fri, Sep 18, 2020 at 7:47 AM Dawid Weiss  wrote:

> > If they try to use any other options then 'i' we throow a ParseException
>
> +1. Complex-syntax parsers should throw (human-palatable) exceptions
> on syntax errors. A lenient, "naive user" query parser should be
> separate and accept a very, very
> rudimentary query syntax (so that there are literally no chances of
> making a syntax error).
>
> D.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: QueryParser - proposed change may break existing queries.

2020-09-17 Thread Mark Harwood
I think the decision comes down to choosing between silent
(mis)interpratations of ambiguous queries or noisy failures..

On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler  wrote:

> Hi,
>
>
>
> My idea would have been not to bee too strict and instead only detect it
> as a regex if its separated. So /foo/bar and /foo/iphone would both go
> through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would
> interpret the first token as regex.
>
>
>
> That’s just my idea, not sure if it makes sense to have this relaxed
> parsing. I was always very skeptical of adding the regexes, as it breaks
> many queries. Now it’s even more.
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Mark Harwood 
> *Sent:* Wednesday, September 16, 2020 6:45 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: QueryParser - proposed change may break existing queries.
>
>
>
> The strictness I was thinking of adding was to make all of the following
> error:
>
>  /foo/bar
>
>  /foo//bar/
>
>  /foo/iphone
>
>  /foo/AND x
>
>
>
> These would be allowed:
>
>  /foo/i bar
>
>  (/foo/ OR /bar/)
>
>  (/foo/ OR /bar/i)
>
>  /foo/^2
>
>  /foo/i^2
>
>
>
>
>
>
>
> On 16 Sep 2020, at 12:00, Uwe Schindler  wrote:
>
> 
>
> In my opinion, the proposed syntax change should enforce to have
> whitespace or any other separator chat after the regex “i” parameter.
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Mark Harwood 
> *Sent:* Wednesday, September 16, 2020 11:04 AM
> *To:* dev@lucene.apache.org
> *Subject:* QueryParser - proposed change may break existing queries.
>
>
>
> In Lucene-9445 we'd like to add a case insensitive option to regex queries
> in the query parser of the form:
>
>/Foo/i
>
>
>
> However, today people can search for :
>
>
>
>/foo.com/index.html
>
>
>
> and not get an error. The searcher may think this is a query for a URL but
> it's actually parsed as a regex "foo.com" ORed with a term query.
>
>
>
> I'd like to draw attention to this proposed change in behaviour because I
> think it could affect many existing systems. Arguably it may be a positive
> in drawing attention to a number of existing silent failures (unescaped
> searches for urls or file paths) but equally could be seen as a negative
> breaking change by some.
>
>
>
> What is our BWC policy for changes to query parser?
>
> Do the benefits of the proposed new regex feature outweigh the costs of
> the breakages in your view?
>
>
>
>
> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>
>
>
>
>
>


Re: QueryParser - proposed change may break existing queries.

2020-09-16 Thread Mark Harwood
The strictness I was thinking of adding was to make all of the following error:
 /foo/bar
 /foo//bar/
 /foo/iphone 
 /foo/AND x

These would be allowed:
 /foo/i bar
 (/foo/ OR /bar/)
 (/foo/ OR /bar/i)
 /foo/^2
 /foo/i^2

 

> On 16 Sep 2020, at 12:00, Uwe Schindler  wrote:
> 
> 
> In my opinion, the proposed syntax change should enforce to have whitespace 
> or any other separator chat after the regex “i” parameter.
>  
> Uwe
>  
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>  
> From: Mark Harwood  
> Sent: Wednesday, September 16, 2020 11:04 AM
> To: dev@lucene.apache.org
> Subject: QueryParser - proposed change may break existing queries.
>  
> In Lucene-9445 we'd like to add a case insensitive option to regex queries in 
> the query parser of the form: 
>/Foo/i
>  
> However, today people can search for :
>  
>/foo.com/index.html
>  
> and not get an error. The searcher may think this is a query for a URL but 
> it's actually parsed as a regex "foo.com" ORed with a term query.
>  
> I'd like to draw attention to this proposed change in behaviour because I 
> think it could affect many existing systems. Arguably it may be a positive in 
> drawing attention to a number of existing silent failures (unescaped searches 
> for urls or file paths) but equally could be seen as a negative breaking 
> change by some.
>  
> What is our BWC policy for changes to query parser?
> Do the benefits of the proposed new regex feature outweigh the costs of the 
> breakages in your view?
>  
> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>  
>  


QueryParser - proposed change may break existing queries.

2020-09-16 Thread Mark Harwood
In Lucene-9445 we'd like to add a case insensitive option to regex queries
in the query parser of the form:
   /Foo/i

However, today people can search for :

   /foo.com/index.html

and not get an error. The searcher may think this is a query for a URL but
it's actually parsed as a regex "foo.com" ORed with a term query.

I'd like to draw attention to this proposed change in behaviour because I
think it could affect many existing systems. Arguably it may be a positive
in drawing attention to a number of existing silent failures (unescaped
searches for urls or file paths) but equally could be seen as a negative
breaking change by some.

What is our BWC policy for changes to query parser?
Do the benefits of the proposed new regex feature outweigh the costs of the
breakages in your view?

https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793


Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-13 Thread Mark Harwood
+1

On 2020/05/12 07:36:57, Dawid Weiss  wrote: 
> Dear Lucene and Solr developers!
> 
> According to an earlier [DISCUSS] thread on the dev list [2], I am
> calling for a vote on the proposal to make Solr a top-level Apache
> project (TLP) and separate Lucene and Solr development into two
> independent entities.
> 
> To quickly recap the reasons and consequences of such a move: it seems
> like the reasons for the initial merge of Lucene and Solr, around 10
> years ago, have been achieved. Both projects are in good shape and
> exhibit signs of independence already (mailing lists, committers,
> patch flow). There are many technical considerations that would make
> development much easier if we move Solr out into its own TLP.
> 
> We discussed this issue [2] and both PMC members and committers had a
> chance to review all the pros and cons and express their views. The
> discussion showed that there are clearly different opinions on the
> matter - some people are in favor, some are neutral, others are
> against or not seeing the point of additional labor. Realistically, I
> don't think reaching 100% level consensus is going to be possible --
> we are a diverse bunch with different opinions and personalities. I
> firmly believe this is the right direction hence the decision to put
> it under the voting process. Should something take a wrong turn in the
> future (as some folks worry it may), all blame is on me.
> 
> Therefore, the proposal is to separate Solr from under Lucene TLP, and
> make it a TLP on its own. The initial structure of the new PMC,
> committer base, git repositories and other managerial aspects can be
> worked out during the process if the decision passes.
> 
> Please indicate one of the following (see [1] for guidelines):
> 
> [ ] +1 - yes, I vote for the proposal
> [ ] -1 - no, I vote against the proposal
> 
> Please note that anyone in the Lucene+Solr community is invited to
> express their opinion, though only Lucene+Solr committers cast binding
> votes (indicate non-binding votes in your reply, please).
> 
> The vote will be active for a week to give everyone a chance to read
> and cast a vote.
> 
> Dawid
> 
> [1] https://www.apache.org/foundation/voting.html
> [2] 
> https://lists.apache.org/thread.html/rfae2440264f6f874e91545b2030c98e7b7e3854ddf090f7747d338df%40%3Cdev.lucene.apache.org%3E
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-07-01 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876507#comment-16876507
 ] 

Mark Harwood commented on LUCENE-8876:
--

I reached out the paper author, Donna Harman a while ago and she just replied 
as follows:
{quote}It has been a very long time since I have thought about S-stemmers.   
But looking at your examples of bees and employees, it seems to me that rule 3 
is the correct one because rule 2 would be prevented from firing. 
{quote}
 

Given her assertion that rule 3 should apply to "bees" then it looks like that 
this would make rule 2 entirely redundant.

> EnglishMinimalStemmer does not implement s-stemmer paper correctly?
> ---
>
> Key: LUCENE-8876
> URL: https://issues.apache.org/jira/browse/LUCENE-8876
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
> employees.
> The [original 
> paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]]
>  has this table of rules:
> !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!
> The notes accompanying the table state :
> {quote}"the first applicable rule encountered is the only one used"
> {quote}
>  
> For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
> misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
> != tomato}}. The {{oes}} and {{ees}} suffixes are left intact.
> "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 
> in the table depending on if you take {{applicable}} to mean "the THEN part 
> of the rule has fired" or just that the suffix was referenced in the rule. 
> EnglishMinimalStemmer has assumed the latter and I think it should be the 
> former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove 
> any trailing S). That's certainly the conclusion I came to independently 
> testing on real data.
> There are some additional changes I'd like to see in a plural stemmer but I 
> won't list them here - the focus should be making the code here match the 
> original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-06-24 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871423#comment-16871423
 ] 

Mark Harwood commented on LUCENE-8876:
--

{quote} but then doesn't it mean that exceptions of the 2nd rule are always 
ignored?
{quote}
 

Good point. Rule 1 exceptions are odd too - I have not found a single common 
English word that ends in aies or eies.

> EnglishMinimalStemmer does not implement s-stemmer paper correctly?
> ---
>
> Key: LUCENE-8876
> URL: https://issues.apache.org/jira/browse/LUCENE-8876
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
> employees.
> The [original 
> paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]]
>  has this table of rules:
> !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!
> The notes accompanying the table state :
> {quote}"the first applicable rule encountered is the only one used"
> {quote}
>  
> For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
> misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
> != tomato}}. The {{oes}} and {{ees}} suffixes are left intact.
> "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 
> in the table depending on if you take {{applicable}} to mean "the THEN part 
> of the rule has fired" or just that the suffix was referenced in the rule. 
> EnglishMinimalStemmer has assumed the latter and I think it should be the 
> former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove 
> any trailing S). That's certainly the conclusion I came to independently 
> testing on real data.
> There are some additional changes I'd like to see in a plural stemmer but I 
> won't list them here - the focus should be making the code here match the 
> original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-06-24 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-8876:


 Summary: EnglishMinimalStemmer does not implement s-stemmer paper 
correctly?
 Key: LUCENE-8876
 URL: https://issues.apache.org/jira/browse/LUCENE-8876
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Mark Harwood


The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
employees.

The [original 
paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]]
 has this table of rules:

!https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!

The notes accompanying the table state :
{quote}"the first applicable rule encountered is the only one used"
{quote}
 

For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
!= tomato}}. The {{oes}} and {{ees}} suffixes are left intact.

"The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in 
the table depending on if you take {{applicable}} to mean "the THEN part of the 
rule has fired" or just that the suffix was referenced in the rule. 
EnglishMinimalStemmer has assumed the latter and I think it should be the 
former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any 
trailing S). That's certainly the conclusion I came to independently testing on 
real data.

There are some additional changes I'd like to see in a plural stemmer but I 
won't list them here - the focus should be making the code here match the 
original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-12 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861960#comment-16861960
 ] 

Mark Harwood commented on LUCENE-8840:
--

{quote}we shouldn't favor documents that contain multiple variations of the 
same fuzzy term.
{quote}
 

For fuzzy I agree that rewarding more variations in a doc is probably 
undesirable - a doc will normally pick one spelling for a word and use it 
consistently so any variations are more likely to be false positives (your 
baz/bad example). Plurals and other forms of suffix would be a notable 
exception but I don't think that's too much of a problem because:
 # we can assume that stemming is taking care of normalizing these tokens.
 # a lot of fuzzy querying is for things like people names that aren't 
expressed as plurals or with other common suffixes

 

I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a 
form of score blending for the expansions they create. Wildcards are perhaps 
unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we 
_are_ looking for multiple forms and a document that contains many is better 
than few.

 

> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> -
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
> match the fuzzy terms. This query blends the frequencies used for scoring 
> across the terms and creates a disjunction of all the blended terms. This 
> means that each fuzzy term that match in a document will add their BM25 score 
> contribution. We already have a query that can blend the statistics of 
> multiple terms in a single scorer that sums the doc frequencies rather than 
> the entire BM25 score: the SynonymQuery. Since 
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles 
> boost between 0 and 1 so it should be easy to change the default rewrite 
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This 
> would bound the contribution of each term to the final score which seems a 
> better alternative in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8352) Make TokenStreamComponents final

2018-06-12 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509635#comment-16509635
 ] 

Mark Harwood commented on LUCENE-8352:
--

My use case was a bit special. I had a custom reader that [dealt with 
hyperlinked 
text|https://github.com/elastic/elasticsearch/issues/29467#issuecomment-385393246]
 and stripped out the hyperlink markup using a custom Reader before feeding the 
remaining plain-text into tokenisation. The tricky bit was the extracted URLs 
would not be thrown away but passed to a special TokenFilter at the end of the 
chain to inject at the appropriate positions in the text token stream.

The workaround was a custom AnalyzerWrapper that overrode wrapReader (which is 
still invoked when wrapped) and then some ThreadLocal hackery to get my 
TokenFilter connected to the Reader's extracted urls. 

I'm not sure how common this sort of analysis is but before I reached this 
solution there was quite a detour trying to figure out why a custom 
TokenStreamComponents was not working when wrapped.

 

> Make TokenStreamComponents final
> 
>
> Key: LUCENE-8352
> URL: https://issues.apache.org/jira/browse/LUCENE-8352
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The current design is a little trappy. Any specialised subclasses of 
> TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
> UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
> them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
> ShingleAnalyzerWrapper and other examples in elasticsearch)_. 
> The current design means each AnalyzerWrapper.wrapComponents() implementation 
> discards any custom TokenStreamComponents and replaces it with one of its own 
> choosing (a vanilla TokenStreamComponents class from examples I've seen).
> This is a trap I fell into when writing a custom TokenStreamComponents with a 
> custom setReader() and I wondered why it was not being triggered when wrapped 
> by other analyzers.
> If AnalyzerWrapper is designed to encourage composition it's arguably a 
> mistake to also permit custom TokenStreamComponent subclasses  - the 
> composition process does not preserve the choice of custom classes and any 
> behaviours they might add. For this reason we should not encourage extensions 
> to TokenStreamComponents (or if TSC extensions are required we should somehow 
> mark an Analyzer as "unwrappable" to prevent lossy compositions).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8352) Make TokenStreamComponents final

2018-06-11 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-8352:


 Summary: Make TokenStreamComponents final
 Key: LUCENE-8352
 URL: https://issues.apache.org/jira/browse/LUCENE-8352
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Mark Harwood


The current design is a little trappy. Any specialised subclasses of 
TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
ShingleAnalyzerWrapper and other examples in elasticsearch)_. 

The current design means each AnalyzerWrapper.wrapComponents() implementation 
discards any custom TokenStreamComponents and replaces it with one of its own 
choosing (a vanilla TokenStreamComponents class from examples I've seen).

This is a trap I fell into when writing a custom TokenStreamComponents with a 
custom setReader() and I wondered why it was not being triggered when wrapped 
by other analyzers.

If AnalyzerWrapper is designed to encourage composition it's arguably a mistake 
to also permit custom TokenStreamComponent subclasses  - the composition 
process does not preserve the choice of custom classes and any behaviours they 
might add. For this reason we should not encourage extensions to 
TokenStreamComponents (or if TSC extensions are required we should somehow mark 
an Analyzer as "unwrappable" to prevent lossy compositions).

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-6747.


> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: Trunk, 5.4
>
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch, fingerprintv4.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-6747.
--
Resolution: Fixed

Commited to trunk and 5.x


> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: Trunk, 5.4
>
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch, fingerprintv4.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Fix Version/s: 5.3.1
   Trunk

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: Trunk, 5.3.1
>
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch, fingerprintv4.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Fix Version/s: (was: 5.3.1)
   5.4

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: Trunk, 5.4
>
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch, fingerprintv4.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv4.patch

Some final tweaks:
1) Found a bug where separator not appended if first token is length ==1
2) Randomized testing identified issue with input.end() not being called when 
IOExceptions occur
3) Added missing SPI entry for FingerprintFilterFactory and associated test 
class for FingerprintFilterFactory

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch, fingerprintv4.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-21 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv3.patch

Updated patch - removed instanceof check and added entry to Changes.txt.

Will commit to trunk and 5.x in a day or two if there's no objections

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Attachments: fingerprintv1.patch, fingerprintv2.patch, 
> fingerprintv3.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv2.patch

Thanks for taking a look, Adrien.
Added a v2 patch with following changes:

1) added call to input.end() to get final offset state
2) final state is retained using captureState()  
3) Added a FingerprintFilterFactory class
 
As for the alternative hashing idea :
For speed reasons this would be a nice idea but reduces the read-ability of 
results if you want to debug any collisions or otherwise display connections.

For compactness reasons (storing in doc values etc) it would always be possible 
to chain a conventional hashing algo in a TokenFilter on the end of this 
text-normalizing filter. (Do we already have a conventional hashing 
TokenFilter?)




> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Attachments: fingerprintv1.patch, fingerprintv2.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv1.patch

Proposed implementation and test

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -
>
> Key: LUCENE-6747
> URL: https://issues.apache.org/jira/browse/LUCENE-6747
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>    Reporter: Mark Harwood
>Priority: Minor
> Attachments: fingerprintv1.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-19 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-6747:


 Summary: FingerprintFilter - a TokenFilter for clustering/linking 
purposes
 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor


A TokenFilter that emits a single token which is a sorted, de-duplicated set of 
the input tokens.
This approach to normalizing text is used in tools like OpenRefine[1] and 
elsewhere [2] to help in clustering or linking texts.
The implementation proposed here has a an upper limit on the size of the 
combined token which is output.

[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
[2] 
https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552265#comment-14552265
 ] 

Mark Harwood commented on LUCENE-329:
-

Committed to 5.x branch and trunk

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550376#comment-14550376
 ] 

Mark Harwood commented on LUCENE-329:
-

Thanks, I'll commit tomorrow if there's no objections.

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Cut-and-paste error in last patch set df=0 and effects were undetected by unit 
tests.
Enhanced unit test to detect error then fixed

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: (was: LUCENE-329.patch)

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Last edits to remove unnecessary Math.max() tests. Added assertion around 
maxTTf expectations

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Updated following review comments (thanks, Adrien).
All tests passing on trunk.

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Switched to the TermContext.accumulateStatistics() method Adrien suggested for 
tweaking stats.

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Fix Version/s: (was: 3.1)
   (was: 4.0-ALPHA)
   5.x

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

New patch addressing this long-standing bug.
Addresses the all-or-nothing choices of today where the default is a (poor) use 
of all IDF factors or a sub-optimal alternative of using a rewrite method with 
no IDF.
The patch includes:
1) A new default FuzzyQuery rewrite method that balances IDF better
2) Unit tests for single and multi-query behaviours

Additionally, this document offers more analysis based on quality tests on a 
slightly larger set of data not included here: 
https://docs.google.com/document/d/1KXhbUpD5GFyzNqfk3nocODOo7Upgpd5tmUQp4-OPwiM/edit#heading=h.2e8gdmdqf2m5


> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0-ALPHA
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-6066) Collector that manages diversity in search results

2015-02-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-6066.

   Resolution: Fixed
Fix Version/s: (was: 5.0)
   5.1

Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike.

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.1
>
> Attachments: LUCENE-6066.patch, LUCENE-PQRemoveV8.patch, 
> LUCENE-PQRemoveV9.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-09 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV9.patch

Move DiversifiedTopDocsCollector and related unit test to "misc".
Added "experimental" annotation.
Removed superfluous "if ==0 " test in PriorityQueue.

Thanks, Adrien.

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV8.patch, LUCENE-PQRemoveV9.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results

2015-02-06 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309365#comment-14309365
 ] 

Mark Harwood commented on LUCENE-6066:
--

bq. maybe we should have this feature in lucene/sandbox or in lucene/misc first 
instead of lucene/core?

It relies on a change to core's PriorityQueue (which was the original focus of 
this issue but then the issue extended into being about the specialized 
collector that is possibly the only justification for introducing a "remove" 
method on PQ).

bq. I think we should also add a lucene.experimental annotation to this 
collector?

That seems fair. 

bq. the `if (size == 0)` condition at the top of PQ.remove looks already 
covered by the below for-loop?

good point, will change.

bq. Should PQ.downHeap and upHead delegate to their counterpart that takes a 
position?

I wanted to avoid the possibility of introducing any slow down to the PQ impl 
by keeping the existing upHeap/downHeap methods intact and duplicating most of 
their logic in the version that takes a position.


> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV8.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV7.patch)

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV8.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV6.patch)

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV8.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV8.patch

Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike).
Will commit to trunk and 5.1 in a day or 2 if no objections. 

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV8.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-22 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV7.patch

Fixed the test PQ's impl of lessThan() which was causing test failures on 
duplicate Integers placed into queue.

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV6.patch, LUCENE-PQRemoveV7.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV5.patch)

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV6.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-19 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV6.patch

Removed outdated acceptDocsInOrder() method.

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV6.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277279#comment-14277279
 ] 

Mark Harwood commented on LUCENE-6066:
--

What feels awkward in the example Junit is that diversified collections are not 
compatible with existing Sort functionality - I had to use a custom Similarity 
class to sort by the popularity of songs in my test data. 
Combining the diversified collector with any other form of existing collector 
(e.g. TopFieldCollector to achieve field-based sorting) via wrapping is 
problematic because the other collectors all work with an assumption that 
previously collected elements are never recalled. The diversifying collector 
needs the ability to recall previously collected elements when new elements 
with the same key need to be substituted.

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV5.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV5.patch

Added Junit test showing use with String based dedup keys using 2 lookup impls 
- slow+accurate global ords and fast but potentially inaccurate hashing of 
BinaryDocValues

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV5.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV3.patch)

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV5.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-05 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Description: This issue provides a new collector for situations where a 
client doesn't want more than N matches for any given key (e.g. no more than 5 
products from any one retailer in a marketplace). In these circumstances a 
document that was previously thought of as competitive during collection has to 
be removed from the final PQ and replaced with another doc (eg a retailer who 
already has 5 matches in the PQ receives a 6th match which is better than his 
previous ones). This requires a new remove method on the existing PriorityQueue 
class.  (was: It would be useful to be able to remove existing elements from a 
PriorityQueue. 
The proposal is that a linear scan is performed to find the element being 
removed and then the end element in heap[size] is swapped into this position to 
perform the delete. The method downHeap() is then called to shuffle the 
replacement element back down the array but the existing downHeap method must 
be modified to allow picking up an entry from any point in the array rather 
than always assuming the first element (which is its only current mode of 
operation).

A working javascript model of the proposal with animation is available here: 
http://jsfiddle.net/grcmquf2/22/ 

In tests the modified version of "downHeap" produces the same results as the 
existing impl but adds the ability to push down from any point.

An example use case that requires remove is where a client doesn't want more 
than N matches for any given key (e.g. no more than 5 products from any one 
retailer in a marketplace). In these circumstances a document that was 
previously thought of as competitive has to be removed from the final PQ and 
replaced with another doc (eg a retailer who already has 5 matches in the PQ 
receives a 6th match which is better than his previous ones). This particular 
process is managed by a special "DiversifyingPriorityQueue" which wraps the 
main PriorityQueue and could be contributed as part of another issue if there 
is interest in that. )
Summary: Collector that manages diversity in search results  (was: New 
"remove" method in PriorityQueue)

> Collector that manages diversity in search results
> --
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV3.patch
>
>
> This issue provides a new collector for situations where a client doesn't 
> want more than N matches for any given key (e.g. no more than 5 products from 
> any one retailer in a marketplace). In these circumstances a document that 
> was previously thought of as competitive during collection has to be removed 
> from the final PQ and replaced with another doc (eg a retailer who already 
> has 5 matches in the PQ receives a 6th match which is better than his 
> previous ones). This requires a new remove method on the existing 
> PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-12-09 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239328#comment-14239328
 ] 

Mark Harwood commented on LUCENE-6066:
--

Thanks for the review, Mike. I'm working through changes.

bq. Why couldn't you just pass your custom queue instead of null to super() in 
DiversifiedTopDocsCollector ctor? 

Oops. That was a cut/paste error transferring code from es that relied on a 
forked PriorityQueue which is obviously incompatible with the Lucene 
TopDocsCollector base class.

bq. the abstract method returns NumericDocValues, which is confusing: how does 
"beatles" become a number? Why not e.g. SortedDVs

I originally had a getKey(docId) method that returned an object - anything 
which implements hashCode and Equals. When I talked through with Adrien he 
suggested the use of NumericDocValues as a better abstraction which could be 
backed by any system based on ordinals. We need to decide on what this 
abstraction should be. One of the things I've been grappling with is if the 
collector should implement support for multi-keyed docs e.g. a field containing 
hashes for near-duplicate detection to avoid too-similar texts. This would 
require extra code in the collector to determine if any one key had exceeded 
limits (and ideally some memory-safeguard for docs with too many keys).

>I saw a test about paging; how does/should paging work with such a collector?

In regular collections, TopScoreDocCollector provides all of the smarts for 
in-order/out-of-order and starting from the ScoreDoc at the bottom of the 
previous page. I expect I would have to reimplement all of it's logic for a new 
DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about 
using updateTop() that don't apply when we have a two-tier system for scoring 
(globally competitive and within-key competitive).  
My vague assumption was that the logic for paging would have to be that any 
per-key constraints would apply across multiple pages e.g. having had 5 Beatles 
hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go 
into the results because it had exhausted the "max 5 per key" limit. This logic 
would probably preclude any use of the deep-paging optimisation where you can 
pass just the ScoreDoc of the last entry on the previous page to minimise the 
size of the PQ created for subsequent pages.





> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV3.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue

2014-12-04 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV2.patch)

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV3.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue

2014-12-04 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV3.patch

Updated patch.
Added DiversifiedTopDocsCollector and associated test. This class represents 
the primary use-case for wanting to include a new remove() method in 
PriorityQueue.
The PriorityQueue has original upHeap/downHeap methods unchanged in case of any 
performance change and a new specialised upHeap/downHeap that takes  a position 
to support the new remove function.

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV2.patch, LUCENE-PQRemoveV3.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223307#comment-14223307
 ] 

Mark Harwood commented on LUCENE-6066:
--

Thanks for your comments, Stefan.

The remove method I believe is implemented correctly now.

bq. it still seems that specialized versions can outperform generic ones

Yes, the DiversifyingPriorityQueue that I imagined would need access to a new 
remove method in the existing PriorityQueue looks like it is better implemented 
as a fork of the existing PriorityQueue. I'll attach this fork here in a future 
addition.
Maybe with these differing implementations there is a need to have a common 
interface that provides an abstraction for things like TopDocsCollector to add 
and pop results.


> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV2.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV2.patch

Added missing upHeap call to remove method.
Added extra randomized tests and method to check validity of PQ elements as 
mutations are made.

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV2.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV1.patch)

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV2.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220089#comment-14220089
 ] 

Mark Harwood commented on LUCENE-6066:
--

bq. But how will you track the min element for each key in the PQ (to know 
which element to remove, when a more competitive hit with that key arrives)?

I was thinking of this as a foundation: (pseudo code) 

{code:title=DiversifyingPriorityQueue.java|borderStyle=solid}
   abstract class KeyedElement {
   int pqPos;
   abstract Object getKey();
   }
   class DiversifyingPriorityQueue extends 
PriorityQueue {
FastRemovablePriorityQueue mainPQ;
Map perKeyQueues;
  }
{code}

You can probably guess at the logic but it is based around: 
* making sure each key has a max of n entries using an entry in perKeyQueues.
* Evictions from the mainPQ will require removal from the related perKeyQueue
* Emptied perKeyQueues can be recycled for use with other keys
* Evictions from the perKeyQueue will require removal from the mainPQ

bq. This seems promising, maybe as a separate dedicated (forked) PQ impl?

Yes, introducing a linear-cost remove by marking elements with a position is an 
added cost that not all PQs will require so forking seems necessary. In this 
case a common abstraction for these different PQs would be useful for the 
places where results are consumed e.g. TopDocsCollector


> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV1.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219901#comment-14219901
 ] 

Mark Harwood commented on LUCENE-6066:
--

An analogy might be making a compilation album of 1967's top hit records:

1) A vanilla Lucene query's results might look like a "Best of the Beatles" 
album - no diversity
2) A grouping query would produce "The 10 top-selling artists of 1967 - some 
killer and quite a lot of filler"
3) A "diversified" query would be the top 20 hit records of that year - with a 
max of 3 Beatles hits to maintain diversity

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV1.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219822#comment-14219822
 ] 

Mark Harwood commented on LUCENE-6066:
--

I guess it's different from grouping in that: 
1) it only involves one pass over the data
2) the client doesn't have to guess the number of groups he is going to need to 
get up-front
3) We don't get any "filler" docs in each group's results i.e. a bunch of 
irrelevant docs for an author with one good hit.

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>      Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV1.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219651#comment-14219651
 ] 

Mark Harwood commented on LUCENE-6066:
--

If the PQ set the current array position as a property of each element every 
time it moved them around I could pass the array index to remove() rather than 
an object that had to be scanned for 

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV1.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV1.patch

New remove(element) method in PriorityQueue and related test

> New "remove" method in PriorityQueue
> 
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV1.patch
>
>
> It would be useful to be able to remove existing elements from a 
> PriorityQueue. 
> The proposal is that a linear scan is performed to find the element being 
> removed and then the end element in heap[size] is swapped into this position 
> to perform the delete. The method downHeap() is then called to shuffle the 
> replacement element back down the array but the existing downHeap method must 
> be modified to allow picking up an entry from any point in the array rather 
> than always assuming the first element (which is its only current mode of 
> operation).
> A working javascript model of the proposal with animation is available here: 
> http://jsfiddle.net/grcmquf2/22/ 
> In tests the modified version of "downHeap" produces the same results as the 
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more 
> than N matches for any given key (e.g. no more than 5 products from any one 
> retailer in a marketplace). In these circumstances a document that was 
> previously thought of as competitive has to be removed from the final PQ and 
> replaced with another doc (eg a retailer who already has 5 matches in the PQ 
> receives a 6th match which is better than his previous ones). This particular 
> process is managed by a special "DiversifyingPriorityQueue" which wraps the 
> main PriorityQueue and could be contributed as part of another issue if there 
> is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6066) New "remove" method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-6066:


 Summary: New "remove" method in PriorityQueue
 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0


It would be useful to be able to remove existing elements from a PriorityQueue. 
The proposal is that a linear scan is performed to find the element being 
removed and then the end element in heap[size] is swapped into this position to 
perform the delete. The method downHeap() is then called to shuffle the 
replacement element back down the array but the existing downHeap method must 
be modified to allow picking up an entry from any point in the array rather 
than always assuming the first element (which is its only current mode of 
operation).

A working javascript model of the proposal with animation is available here: 
http://jsfiddle.net/grcmquf2/22/ 

In tests the modified version of "downHeap" produces the same results as the 
existing impl but adds the ability to push down from any point.

An example use case that requires remove is where a client doesn't want more 
than N matches for any given key (e.g. no more than 5 products from any one 
retailer in a marketplace). In these circumstances a document that was 
previously thought of as competitive has to be removed from the final PQ and 
replaced with another doc (eg a retailer who already has 5 matches in the PQ 
receives a 6th match which is better than his previous ones). This particular 
process is managed by a special "DiversifyingPriorityQueue" which wraps the 
main PriorityQueue and could be contributed as part of another issue if there 
is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2013-07-24 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-725:


Attachment: NovelAnalyzer.java

Updated to work with Lucene 4 APIs. 

> NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all 
> "boilerplate" text
> ---
>
> Key: LUCENE-725
> URL: https://issues.apache.org/jira/browse/LUCENE-725
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Mark Harwood
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: NovelAnalyzer.java, NovelAnalyzer.java, 
> NovelAnalyzer.java, NovelAnalyzer.java
>
>
> This is a class I have found to be useful for analyzing small (in the 
> hundreds) collections of documents and  removing any duplicate content such 
> as standard disclaimers or repeated text in an exchange of  emails.
> This has applications in sampling query results to identify key phrases, 
> improving speed-reading of results with similar content (eg email 
> threads/forum messages) or just removing duplicated noise from a search index.
> To be more generally useful it needs to scale to millions of documents - in 
> which case an alternative implementation is required. See the notes in the 
> Javadocs for this class for more discussion on this

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4866) Lucene corruption

2013-03-21 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608826#comment-13608826
 ] 

Mark Harwood commented on LUCENE-4866:
--

The fact that the missing file looks to be held on a shared drive might also be 
significant if there is >1 Lucene process configured to access the same 
directory ...

> Lucene corruption
> -
>
> Key: LUCENE-4866
> URL: https://issues.apache.org/jira/browse/LUCENE-4866
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 3.5
> Environment: Amazone tomcat cluster with NTFS. 
>Reporter: sachin
>Priority: Blocker
>
> Hi all,
> We know that lucene index gets corrupted. in our case they are corrupting 
> again and again due to this production is incosistent. followiing errors are 
> observed. Any help will be helpful.
> org.hibernate.search.SearchException: Unable to reopen IndexReader
> at 
> org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:230)
> at 
> org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:73)
> at 
> org.hibernate.search.reader.impl.MultiReaderFactory.openReader(MultiReaderFactory.java:49)
> at 
> org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:596)
> at 
> org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:495)
> at 
> org.hibernate.search.query.engine.impl.HSQueryImpl.queryEntityInfos(HSQueryImpl.java:239)
> at 
> org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.list(FullTextQueryImpl.java:209)
> at 
> com.lifetech.ngs.dataaccess.spring.util.SearchUtil.returnProjectionData(SearchUtil.java:646)
> at 
> com.lifetech.ngs.dataaccess.spring.util.SearchUtil.getSinglePropertyOnlyUsingSearch(SearchUtil.java:556)
> at 
> com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$FastClassByCGLIB$$568d5972.invoke()
> at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
> at 
> org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
> at 
> org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
> at 
> org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622)
> at 
> com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$EnhancerByCGLIB$$47fb00d0.getSinglePropertyOnlyUsingSearch()
> at 
> com.lifetech.ngs.server.impl.SampleManagerImpl.getNameSearchResult(SampleManagerImpl.java:2436)
> at 
> com.lifetech.ngs.server.impl.SampleManagerImpl$$FastClassByCGLIB$$17af181d.invoke()
> at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
> at 
> org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
> at 
> org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
> at 
> org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622)
> at 
> com.lifetech.ngs.server.impl.SampleManagerImpl$$EnhancerByCGLIB$$75b745f9.getNameSearchResult()
> at 
> com.lifetech.ngs.webui.mgc.widgets.sample.SearchSamplesView.populateData(SearchSamplesView.java:635)
> at 
> com.lifetech.ngs.webui.customcomponents.IRAutoComplete.changeVariables(IRAutoComplete.java:39)
> at 
> com.vaadin.terminal.gwt.server.AbstractCommunicationManager.changeVariables(AbstractCommunicationManager.java:1445)
> at 
> com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariableBurst(AbstractCommunicationManager.java:1393)
> at 
> com.lifetech.ngs.webui.main.SpringVaadinServlet$1.handleVariableBurst(SpringVaadinServlet.java:57)
> at 
> com.vaadin.termi

Re: New Lucene features and Solr indexes

2013-02-13 Thread mark harwood
>>should be a stupid simple postings format like any other postings format with 
>>a default configuration

It does have a default config. It just needs a PF delegate in the constructor 
just like Pulsing
Like Rob said:
>>In other words, it should work just like pulsing.


So far so good.

Now where people are getting upset (for no particularly good reason in my view) 
around per-field stuff:  if you really, really want to you can supply a 
subclass of BloomFilterFactory to your BloomPF constructor which allows 
customised control over choice of hashing algo, bitset sizing and saturation 
policies if the DefaultBloomFilterFactory fails to make the right choices.  
99.9% of people will not do this. The reason it is a factory object and not 
some dumb settings is that it is called on a per-segment basis with state info 
that is useful context in making sizing choices.  Now, (horror of horrors), the 
factory's API is passed a FieldInfo object in the method designed to produce a 
bitset. It is conceivable that some rogue agents could choose to implement some 
per-field decisions here if the same BloomPF instance was registered to handle 
>1 field. In addition, BloomPF has some common-sense defensive coding that 
checks if the factory returns null
 for the bitset - in which case it delegates all calls un-bloomed directly to 
the delegate codec. 

None of this prevents the use of BloomPF with the prescribed PerFieldPF manner 
for handling field-specific choices.

I happen to use a custom BloomFilterFactory to implement a more efficient 
indexing pipeline than the prescribed PerFieldPF route of implementing all 
per-field policies "up high" in the stack -  but none of that is at the cost of 
a clean BloomPF API or with any unnecessary duplication of PerFieldPF logic. 

If anything needs changing here there may be a case for providing a convenience 
class that weds BloomPF and a default choice of Lucene40 codec so it can help 
with whatever Solr and other config-driven engines may need ie  zero arg 
constructors if that's how their registry of codecs works.

Cheers
Mark













 From: Uwe Schindler 
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 16:47
Subject: RE: New Lucene features and Solr indexes
 
Hi Shawn,

I was arguing also at the time when this was committed. I fully agree with 
Robert, the current API is not in a good shape!
I have the same feeling: Bloom Postings should be a stupid simple postings 
format like any other postings format with a default configuration. If you 
really want to change its configuration, you can subclass it as a separate 
postings format.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: Wednesday, February 13, 2013 3:59 PM
> To: dev@lucene.apache.org
> Subject: Re: New Lucene features and Solr indexes
> 
> >> BloomFilterPostingsFormat is a little special compared to other
> >> postings formats because it can wrap any postings format. So maybe it
> >> should require special support, like an additional attribute in the
> >> field type definition?
> >
> > -1
> >
> > Instead of making other APIs to accomodate BloomFilter's current
> > brokenness: remove its custom per-field logic so it works with
> > PerFieldPostingsFormat, like every other PF.
> >
> > In other words, it should work just like pulsing.
> >
> > I brought this up before it was committed, and i was ignored. Thats
> > fine, but I'll be damned if i let its incorrect design complicate
> > other parts of the codebase too. I'd rather it continue to stay
> > difficult to integrate and continue walking its current path to an
> > open source death instead.
> 
> Robert,
> 
> I have to send you a general thank you for your dedication to the quality of
> this project, and for your amazing ability to seemingly keep the entire design
> for Lucene in your head at all times.
> 
> I'm not sure what exactly you want to die here, or what you think would be
> the best option for me, the Solr end-user.  Is BloomFilter something that's
> not worth pursuing, or would you just like it to be integrated in a different
> way?
> 
> Thanks,
> Shawn
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: New Lucene features and Solr indexes

2013-02-13 Thread mark harwood
>>Instead of making other APIs to accomodate BloomFilter's current
>>brokenness: remove its custom per-field logic so it works with
>>PerFieldPostingsFormat, like every other PF.



Not looked at it in a while but I'm pretty certain, like every other PF, you 
can go ahead and use PerFieldPF with Bloom filter just fine.

What was broken was (is?) that in this configuration PFPF isn't smart enough to 
avoid creating twice as many files as is required - see Lucene 4093.
Until that is resolved (and I have noted my pessimism about that being fixed 
easily) BloomPF contains an optimisation for those that want to avoid this 
inefficiency.
The use of that optimisation is entirely optional for users.
Internally to BloomPF, the implementation of that optimisation is trivial  - if 
a null bloom set is returned for a given field it ignores the usual bloom 
filtering logic and delegates directly to the wrapped codec. 
You can choose to implement a BloomFilterFactory that adds this field-choice 
optimisation or, more simply run the default PerFieldPF-managed configuration 
and live with the increased numbers of files.

Arguably, the inefficiencies of the PerFieldPF framework are the real issue to 
be addressed here.

>>I brought this up before it was committed, and i was ignored

You stopped engaging in the debate when I outlined the 3 proposed options for 
moving BloomPF forward :  http://goo.gl/mxtP9
Those options were:
1) ignore the inefficiencies in PFPF
2) sort out the issues in PFPF (4093 but probably a more complex solution)
3) work around existing PFPF issues with a simple but entirely optional 
optimisation to BloomPF

I opted for 3) and gave notice that I 'd take it out if anyone objected. 
I don't think there's been any movement on 2) so I guess you're still happy 
with option 1)? I recall you didn't think the business of extra files was that 
much of a concern: http://goo.gl/eJWo3


(Incidentally, probably best following up on the relevant Jiras rather than 
here)

Cheers
Mark




 From: Robert Muir 
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 13:01
Subject: Re: New Lucene features and Solr indexes
 
On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand  wrote:
> Hi Shawn,
>
> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey  wrote:
>> Some of these, like compressed stored fields and compressed termvectors, are
>> being turned on by default, which is awesome.  I'm already running a 4.2
>> snapshot, so I've got those in place.
>
> Excellent!
>
>> One thing that I know I would like to do is use the new BloomFilter for a
>> couple of my fields that contain only unique values.  Last time I checked
>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>> had a BloomFilter postings format, but didn't have any way to specify the
>> underlying format.  See SOLR-3950 and LUCENE-4394.
>
> BloomFilterPostingsFormat is a little special compared to other
> postings formats because it can wrap any postings format. So maybe it
> should require special support, like an additional attribute in the
> field type definition?

-1

Instead of making other APIs to accomodate BloomFilter's current
brokenness: remove its custom per-field logic so it works with
PerFieldPostingsFormat, like every other PF.

In other words, it should work just like pulsing.

I brought this up before it was committed, and i was ignored. Thats
fine, but I'll be damned if i let its incorrect design complicate
other parts of the codebase too. I'd rather it continue to stay
difficult to integrate and continue walking its current path to an
open source death instead.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575864#comment-13575864
 ] 

Mark Harwood commented on LUCENE-4768:
--

OK - this problem seems to be about an ill-defined user query ("Saturn sky blue 
Sedan" with no explicit fields) being executed against a well-defined schema 
(cars with manufacturers, model names and bodyStyles that also have trims with 
colours).

If that's the case you have a heap of problems here which aren't necessarily 
related to the "block join" implementation. One example - IDF ranking being 
what it is, if a manufacturer like Ford create a model called the "Blue" or you 
have bad data entry that has an example of this value stored in the wrong field 
then Lucene will naturally rank model:blue higher than color:blue because of 
the scarcity of the token "blue" in that field context. That's almost the 
inverse of what you want.

A couple of suggestions for "field-less" queries like your example of "Saturn 
sky blue sedan"
1) Target the query on an unstructured "onebox" field that holds indexed 
content from all fields to achieve a more balanced IDF score.
2) Tokenize each item in the query string and find a "most likely" field for 
each search term by examining doc frequencies e.g. color:blue vs modelName:blue 
etc. Augment the "onebox" query in 1) with the most-likely-field interpretation 
for each word in the query string if it has sufficient doc frequency.






> Child Traversable To Parent Block Join Query
> 
>
> Key: LUCENE-4768
> URL: https://issues.apache.org/jira/browse/LUCENE-4768
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
> Environment: trunk
> git rev-parse HEAD
> 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41
>Reporter: Vadim Kirilchuk
> Attachments: LUCENE-4768-draft.patch
>
>
> Hi everyone!
> Let me describe what i am trying to do:
> I have hierarchical documents ('car model' as parent, 'trim' as child) and 
> use block join queries to retrieve them. However, i am not happy with current 
> behavior of ToParentBlockJoinQuery which goes through all parent childs 
> during nextDoc call (accumulating scores and freqs).
> Consider the following example, you have a query with a custom post condition 
> on top of such bjq: and during post condition you traverse scorers tree 
> (doc-at-time) and want to manually push child scorers of bjq one by one until 
> condition passes or current parent have no more childs.
> I am attaching the patch with query(and some tests) similar to 
> ToParentBlockJoin but with an ability to traverse childs. (i have to do weird 
> instance of check and cast inside my code) This is a draft only and i will be 
> glad to hear if someone need it or to hear how we can improve it. 
> P.s i believe that proposed query is more generic (low level) than 
> ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() 
> internally during nextDoc().
> Also, i think that the problem of traversing hierarchical documents is more 
> complex as lucene have only nextDoc API. What do you think about making api 
> more hierarchy aware? One level document is a special case of multi level 
> document but not vice versa. WDYT?
> Thanks in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575825#comment-13575825
 ] 

Mark Harwood commented on LUCENE-4768:
--

Still not sure what problem you are trying to solve. 
bq. i need to know field and text for each matched leaf scorer 

Why? For scoring purposes? ToParentBJQ has a configurable ScoreMode to control 
if you want the max, avg or sum of the child matches rolled into the combined 
parent score. Is that insufficient control for your needs?


> Child Traversable To Parent Block Join Query
> 
>
> Key: LUCENE-4768
> URL: https://issues.apache.org/jira/browse/LUCENE-4768
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
> Environment: trunk
> git rev-parse HEAD
> 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41
>Reporter: Vadim Kirilchuk
> Attachments: LUCENE-4768-draft.patch
>
>
> Hi everyone!
> Let me describe what i am trying to do:
> I have hierarchical documents ('car model' as parent, 'trim' as child) and 
> use block join queries to retrieve them. However, i am not happy with current 
> behavior of ToParentBlockJoinQuery which goes through all parent childs 
> during nextDoc call (accumulating scores and freqs).
> Consider the following example, you have a query with a custom post condition 
> on top of such bjq: and during post condition you traverse scorers tree 
> (doc-at-time) and want to manually push child scorers of bjq one by one until 
> condition passes or current parent have no more childs.
> I am attaching the patch with query(and some tests) similar to 
> ToParentBlockJoin but with an ability to traverse childs. (i have to do weird 
> instance of check and cast inside my code) This is a draft only and i will be 
> glad to hear if someone need it or to hear how we can improve it. 
> P.s i believe that proposed query is more generic (low level) than 
> ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() 
> internally during nextDoc().
> Also, i think that the problem of traversing hierarchical documents is more 
> complex as lucene have only nextDoc API. What do you think about making api 
> more hierarchy aware? One level document is a special case of multi level 
> document but not vice versa. WDYT?
> Thanks in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575740#comment-13575740
 ] 

Mark Harwood commented on LUCENE-4768:
--

As with any discussion about nested queries you need to be very clear about the 
required logic. When you talk about matching f1:A or f1:B - are we talking 
about matches on the same child doc or possibly matches on different child docs 
of the same parent? The examples don't make this clear.
If we assume your child-based criteria is focused on examining the contents of 
single children (as opposed to combining f1:A on one child doc with f1:B on a 
different child doc) then a BooleanQuery that combines these child query 
elements will already be sufficient for skipping through children.

Not really sure what you are trying to optimize anyway with skipping - 
parent-child combos are limited to what fits into a single segment which is in 
turn limited by RAM. You don't generally get parents with "many many" children 
because of these constraints. The "nextDoc" calls you are trying to skip are 
related to a compressed block of child doc IDs (gap encoded varints) that are 
read off disk in 1K chunks (if I recall default Directory settings correctly). 
The chances are high that the limited number of child docIDs that belong to 
each parent are already in RAM as part of normal disk access patterns so there 
is no real saving in disk IO. Are you sure this is a performance bottleneck?




> Child Traversable To Parent Block Join Query
> 
>
> Key: LUCENE-4768
> URL: https://issues.apache.org/jira/browse/LUCENE-4768
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
> Environment: trunk
> git rev-parse HEAD
> 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41
>Reporter: Vadim Kirilchuk
> Attachments: LUCENE-4768-draft.patch
>
>
> Hi everyone!
> Let me describe what i am trying to do:
> I have hierarchical documents ('car model' as parent, 'trim' as child) and 
> use block join queries to retrieve them. However, i am not happy with current 
> behavior of ToParentBlockJoinQuery which goes through all parent childs 
> during nextDoc call (accumulating scores and freqs).
> Consider the following example, you have a query with a custom post condition 
> on top of such bjq: and during post condition you traverse scorers tree 
> (doc-at-time) and want to manually push child scorers of bjq one by one until 
> condition passes or current parent have no more childs.
> I am attaching the patch with query(and some tests) similar to 
> ToParentBlockJoin but with an ability to traverse childs. (i have to do weird 
> instance of check and cast inside my code) This is a draft only and i will be 
> glad to hear if someone need it or to hear how we can improve it. 
> P.s i believe that proposed query is more generic (low level) than 
> ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() 
> internally during nextDoc().
> Also, i think that the problem of traversing hierarchical documents is more 
> complex as lucene have only nextDoc API. What do you think about making api 
> more hierarchy aware? One level document is a special case of multi level 
> document but not vice versa. WDYT?
> Thanks in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477036#comment-13477036
 ] 

Mark Harwood commented on SOLR-3950:


bq. If there is some schema config that will tell Solr to do the right thing, 
please let me know.

Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as 
to what delegate it will use before you can use it at write-time.
I think we have 3 options:

1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF 
e.g. Lucene40PF so you would have a zero-arg-constructor class named something 
like BloomLucene40PF or...
2) Solr extends config file format to provide a generic means of assembling 
"wrapper" PFs like Bloom in their config e.g:
   postingsFormat="BloomFilter" delegatePostingsFormat="FooPF" 
   and Solr then does reflection magic to call constructors appropriately or..
3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. 
Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for 
BloomPF.

Of these 1) feels like "the right thing".

Cheers
Mark

> Attempting postings="BloomFilter" results in UnsupportedOperationException
> --
>
> Key: SOLR-3950
> URL: https://issues.apache.org/jira/browse/SOLR-3950
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.1
> Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
> SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> [root@bigindy5 ~]# java -version
> java version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>Reporter: Shawn Heisey
> Fix For: 4.1
>
>
> Tested on branch_4x, checked out after BlockPostingsFormat was made the 
> default by LUCENE-4446.
> I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
> copied it into my sharedLib directory.  When I subsequently tried 
> postings="BloomFilter" I got a the following exception in the log:
> {code}
> Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.UnsupportedOperationException: Error - 
> org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
> constructed without a choice of PostingsFormat
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476854#comment-13476854
 ] 

Mark Harwood commented on SOLR-3950:


BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat 
and adds ".blm" files to the other files created by the choice of delegate.

However your code has instantiated a BloomFilterPostingsFormat without passing 
a choice of delegate - presumably using the zero-arg constructor. 
The comments in the code for this zero-arg constructor state:

  // Used only by core Lucene at read-time via Service Provider instantiation -
  // do not use at Write-time in application code.





> Attempting postings="BloomFilter" results in UnsupportedOperationException
> --
>
> Key: SOLR-3950
> URL: https://issues.apache.org/jira/browse/SOLR-3950
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.1
> Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
> SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> [root@bigindy5 ~]# java -version
> java version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>Reporter: Shawn Heisey
> Fix For: 4.1
>
>
> Tested on branch_4x, checked out after BlockPostingsFormat was made the 
> default by LUCENE-4446.
> I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
> copied it into my sharedLib directory.  When I subsequently tried 
> postings="BloomFilter" I got a the following exception in the log:
> {code}
> Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.UnsupportedOperationException: Error - 
> org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
> constructed without a choice of PostingsFormat
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work

2012-10-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476044#comment-13476044
 ] 

Mark Harwood commented on LUCENE-3772:
--

For bigger-than-memory docs is it not possible to use nested documents to 
represent subsections (e.g. a child doc for each of the chapters in a book) and 
then use BlockJoinQuery to select the best child docs?
Highlighting can then be used on a more-manageable subset of the original 
content and Lucene's ranking algos are being used to select the best "fragment" 
rather than the highlighter's own attempts to reproduce this logic.

Obviously depends on the shape of your content/queries but books-and-chapters 
is probably a good fit for this approach.

> Highlighter needs the whole text in memory to work
> --
>
> Key: LUCENE-3772
> URL: https://issues.apache.org/jira/browse/LUCENE-3772
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 3.5
> Environment: Windows 7 Enterprise x64, JRE 1.6.0_25
>Reporter: Luis Filipe Nassif
>  Labels: highlighter, improvement, memory
>
> Highlighter methods getBestFragment(s) and getBestTextFragments only accept a 
> String object representing the whole text to highlight. When dealing with 
> very large docs simultaneously, it can lead to heap consumption problems. It 
> would be better if the API could accept a Reader objetct additionally, like 
> Lucene Document Fields do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452914#comment-13452914
 ] 

Mark Harwood commented on LUCENE-4369:
--

Agreed on the need for a change - names are important.

I have a problem with using "match" on its own because the word is often 
associated with partial matching e.g. "best match" or "fuzzy match".
A quick google suggests "match" has more connotations with fuzziness than 
exactness - there are 162m results for "best match" vs only 45m results for 
"exact match".

So how about "ExactMatchField"?




> StringFields name is unintuitive and not helpful
> 
>
> Key: LUCENE-4369
> URL: https://issues.apache.org/jira/browse/LUCENE-4369
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
> Attachments: LUCENE-4369.patch
>
>
> There's a huge difference between TextField and StringField, StringField 
> screws up scoring and bypasses your Analyzer.
> (see java-user thread "Custom Analyzer Not Called When Indexing" as an 
> example.)
> The name we use here is vital, otherwise people will get bad results.
> I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452900#comment-13452900
 ] 

Mark Harwood commented on LUCENE-4369:
--

SingleTermField ?

Not sure "matching vs searching" is a commonly understood differentiation.

> StringFields name is unintuitive and not helpful
> 
>
> Key: LUCENE-4369
> URL: https://issues.apache.org/jira/browse/LUCENE-4369
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
> Attachments: LUCENE-4369.patch
>
>
> There's a huge difference between TextField and StringField, StringField 
> screws up scoring and bypasses your Analyzer.
> (see java-user thread "Custom Analyzer Not Called When Indexing" as an 
> example.)
> The name we use here is vital, otherwise people will get bad results.
> I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters

2012-08-13 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433045#comment-13433045
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact 
use case.

Fair enough - the original patch targeted Lucene 3.6 which benefited more 
heavily from this technique. The issue then morphed into a 4.x patch where 
performance gains were harder to find. 
I think the sweet spot is in primary key searches on indexes with ongoing heavy 
changes (more segment fragmentation, less OS-level caching?). This is the use 
case I am targeting currently and my final tests using our primary-key-counting 
test rig saw a 10 to 15% improvement over Pulsing.

bq. I'm asking because I need his feature but I'm stuck with 3.x for a while. 

I have a client in a similar situation who are contemplating using the 3.6 
patch.

bq. Is there bugs which should be fixed in initial 3.6 patch? 

It has been a while since I looked at it - a quick run of "ant test" on my copy 
here showed no errors. I will be giving it a closer review if my client decides 
to go down this route and can post any fixes here.
I expect if you use the patch and get into trouble you can use an un-patched 
version of 3.6 to read the same index files (it should just ignore the extra 
"blm" files created by the patched version).


> Segment-level Bloom filters
> ---
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6, 4.0-ALPHA
>    Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Fix Version/s: 5.0

Applied to trunk in revision 1368567

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>    Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427322#comment-13427322
 ] 

Mark Harwood commented on LUCENE-4069:
--

Will do.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-4069.
--

Resolution: Fixed
  Assignee: Mark Harwood

Committed to 4.0 branch, revision 1368442

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>    Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated patch to bring in line with latest core API changes.
All tests now pass clean so will commit soon

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
> LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated with fix to issue explored in Lucene-4275

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
> LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-4275.


Resolution: Not A Problem

> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>    Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426481#comment-13426481
 ] 

Mark Harwood commented on LUCENE-4275:
--


Nailed it, Mike. Yet another beer I owe you.
I removed the IllegalStateException and it looks like the retry logic is now 
kicking in and all tests pass 

This reliance on throwing a particular exception type feels like an important 
contract to document. Currently the comments in PostingsFormat.fieldsProducer() 
read as follows:

bq.   Reads a segment.  NOTE: by the time this call returns, it must hold open 
any files it will need to use; else, those files may be deleted. 

I propose adding:

bq. Additionally, required files may be deleted during the execution of this 
call before there is a chance to open them. Under these circumstances an 
IOException should be thrown by the implementation. IOExceptions are expected 
and will automatically cause a retry of the segment opening logic with the 
newly revised segments

I'll roll that documentation addition into my Lucene-4069 patch


> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425895#comment-13425895
 ] 

Mark Harwood commented on LUCENE-4275:
--

Thanks, Rob. This test requires a call to "ant clean" between each run before 
it will consistently work. However, I don't consider that a fix and assume that 
we are still looking for a bug here as there's an index consistency issue 
lurking somewhere here. I've tried adding the setting 
-Dtests.directory=RAMDirectory but the test still looks to have some "memory" 
between runs.

I added some logging of creates and deletes as you suggest and it looks like on 
a second, un-cleansed run, my PF is being called to open a high-numbered 
segment which I suspect was created by an earlier run as the logging doesn't 
show signs of the PF being asked to created content for this (or any other) 
segment as part of the current run. At this point it fails as there is no 
longer a copy of  the "foobar" file listed by the directory.
I have noticed in the logs from previous runs MDW is asked to delete the 
segment's "foobar" file by IndexWriter as part of compaction into a compound 
CFS.

Hope this sheds some light as I'm finding this a complex one to debug.


> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>    Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4275:
-

Attachment: Lucene-4275-TestClass.patch

Attached simple PostingsFormat used to illustrate cases of files going missing 
in PF tests.

> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>    Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-4275:


 Summary: Threaded tests with MockDirectoryWrapper delete active 
PostingFormat files
 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0


As part of testing Lucene-4069 I have encountered sporadic issues with files 
going missing. I believe this is a bug in the test framework (multi-threading 
issues in MockDirectoryWrapper?) so have raised a separate issue with 
simplified test PostingFormat class here.
Using this test PF will fail due to a missing file roughly one in four times of 
executing this test:
ant test-core  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: 4069Failure.zip

Attached a log of thread activity showing how 
TestIndexWriterCommit.testCommitThreadSafety() is failing.
At this stage I can't tell if this is a failing in MockDirectoryWrapper or the 
test or the BloomPF class but it is related to files being removed unexpectedly.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418411#comment-13418411
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I wonder if it has to do w/ only opening the file in the close() method (

Just tried opening the file earlier (in BloomFilteredConsumer constructor) and 
that didn't fix it.
I previously also added an extra Directory.fileExists() sanity check 
immediately after closing the IndexOutput and all was well so I think it's 
something happening after that. Will need to dig deeper.
I'm running on WinXP 64bit if that is of any significance to MDW's behaviour.


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418314#comment-13418314
 ] 

Mark Harwood commented on LUCENE-4069:
--

One more remaining issue before I commit which has appeared sporadically and 
looks to be consistently raised by this test:
ant test  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast 
-Dtests.file.encoding=ISO-8859-1

The error it produces is this: 
[junit4:junit4]> Caused by: java.lang.IllegalStateException: Missing 
file:_9_TestBloomFilteredLucene40Postings_0.blm
[junit4:junit4]>at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.(BloomFilteringPostingsFormat.java:175)
[junit4:junit4]>at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156)


MockDirectoryWrapper looks to be randomly deleting files (probably my "blm" 
file shown above) to simulate the effects of crashes.
Presumably I am doing the "right thing" in always throwing an exception if the 
.blm file is missing? The alternative would be to silently ignore the missing 
file which seems undesirable.
IF MDW is intended to only delete uncommitted files I'm not sure how we end up 
in a scenario where BloomPF is being asked to open the uncommitted segment?








> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416383#comment-13416383
 ] 

Mark Harwood commented on LUCENE-4069:
--

A quick benchmark looks like the new right-sized bitset as opposed to the old 
worst-case-scenario-sized bitset is buying us a small performance improvement.

bq. I also don't think this PF should be per-field

There was a lengthy discussion earlier on this topic. The approach presented 
here seems reasonable.
For the average user you have the DefaultBloomFilterFactory default which now 
has reasonable sizing for all fields passed its way (assuming a heuristic based 
on numDocs=numKeys to anticipate). For expert users you can provide a 
BloomFilterFactory with a custom choice of sizing heuristic per-field and can 
also simply return "null" for non-bloomed fields.

Having a single, carefully configured BloomPF wrapper is preferable because you 
can channel appropriately configured bloom settings to a common PF delegate and 
avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart 
enough to figure out that these Bloom-ing choices do not require different 
physical files for all the delegated tii etc structures.

You don't *have* to use the Per-field stuff in BloomPF but there are benefits 
to be had in doing so which can't otherwise be achieved.


bq. Can you add @lucene.experimental to all the new APIs?

Done.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

New patch with use of SegmentWriteState to right-size the choice if bitset for 
volume of content.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416084#comment-13416084
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. MessageDigest.getInstance(name) should be the way to go

I'm less keen now - a quick scan of the docs around MessageDigest throws up 
some issues:
1) SPI registration of MessageDigest providers looks to get into permissions 
hell as it is closely related to security - see 
http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling
 which talks about the steps required to approve a trusted "provider".
2) MessageDigest as an interface is designed to stream content in potentially 
many method calls past the hashing algo. MurmurHash2.java is not currently 
written to process content this way and suits our needs in hashing small blocks 
of content in one hit. 

For these 2 reasons it looks like MessageDigest may be a pain to adopt and the 
existing approach proposed in this patch may be preferable.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416037#comment-13416037
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  If a special decoder for foobar is needed, it must be loadable by SPI. 

I think we are in agreement on the broad principles. The fundamental question 
here though is do you want to treat an index's choice of Hash algo as something 
that would require a new SPI-registered PostingsFormat to decode or can that be 
handled as I have done here with a general purpose SPI framework for hashing 
algos? 

Actually, re-thinking this, I suspect rather than creating our own, I can use 
Java's existing SPI framework for hashing in the form of MessageDigest. I'll 
take a closer look into that...



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416007#comment-13416007
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. At a minimum I think before committing we should make the SegmentWriteState 
accessible.

OK. Will that be the subject of a new Jira?

bq. Hmm why is anonymity at search time important?

It would seem to be an established design principle - see 
https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726

It would be a pain if user config settings require a custom SPI-registered 
class around just to decode the index contents. There's the resource/classpath 
hell, the chance for misconfiguration and running Luke suddenly gets more 
complex.
The line to be drawn is between what are just config settings (field names, 
memory limits) and what are fundamentally different file formats (e.g. codec 
choices).
The design principle that looks to be adopted is that the former ought to be 
accommodated without the need for custom SPI-registered classes and the latter 
would need to locate an implementation via SPI to decode stored content. Seems 
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they 
all produce an int) so I would suggest we treat this as a config setting rather 
than a fundamentally different choice of file format.







> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415362#comment-13415362
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. It's the unique term count (for this one segment) that you need right? 
Yes, I need it before I start processing the stream of terms being flushed.
 
bq. Seems like LUCENE-4198 needs to solve this same problem.

Another possibly related point on more access to "merge context" - custom 
codecs have a great opportunity at merge time to piggy-back some analysis on 
the data being streamed e.g. to spot "trending" terms whose term frequencies 
differ drastically between the merging source segments. This would require 
access to "source segment" as term postings are streamed to observe the change 
in counts. 

bq. Also, why do we need to use SPI to find the HashFunction? Seems like 
overkill... we don't (yet) have a bunch of hash functions that are vying here 
right?

There's already a MurmurHash3 algo - we're currently using v2 and so could 
anticipate an upgrade at some stage. This patch provides that future proofing.

bq. can't the postings format impl pass in an instance of HashFunction when 
making the FuzzySet

I don't think that is going to work. Currently all PostingFormat impls that 
extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). 
All their settings (fields, hash algo, thresholds) etc are recorded at write 
time by the base class in the segment. At read-time it is the 
BloomFilterPostingsFormat base class that is instantiated, not the write-time 
subclass and so we need to store the hash algo choice. We can't rely on the 
original subclass being around and configured appropriately with the original 
write-time choice of hashing function.

I think the current way feels safer over all and also allows other Lucene 
functions to safely record hashes along with a hashname string that can be used 
to reconstitute results. 

bq. Can you move the imports under the copyright header?

Will do



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added bloom package.html and changes.txt. I plan to commit in a day or two if 
there are no objections.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-10 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410145#comment-13410145
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. So now we are close to 1M lookups/sec for a single thread!

Cool!

bq. I wonder if somehow we can do a better job picking the right sized bit 
vector up front? 
bq. You basically need to know up front how many unique terms will be in the 
given field for this segment right?

Yes - the job of anticipating the number of unique keys probably has 2 
different contexts:
1) Net new segments e.g. guessing up front how many docs/keys a user is likely 
to generate in a new segment before the flush settings kick in.
2) Merged segments e.g. guessing how many unique keys survive a merge operation

Estimating key volumes in context 1 is probably hard without some additional 
hints from the end user. Arguably the BloomFilterFactory.getSetForField() 
method already represents where this setting can be controlled.
In context 2 where potentially large merges occur we could look at adding an 
extra method to BloomFilterFactory to handle this different context e.g. 
something like
   FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext)
Based on the size of the segments being merged and volumes of deletes a more 
appropriate size of Bloom bitset could be allocated based on a worst-case 
estimate.
Not sure how we get the OneMerge instance fed through the call stack - could 
that be held somewhere on a ThreadLocal as generally useful context?





> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408097#comment-13408097
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the extra tests, Mike. That's tightened performance but that lookss 
a scary amount of code for the optimal solution of this basic incrementing 
operation :)

I've done some more benchmarks with the updated test and the performance 
characteristics are becoming clearer as shown in these results: 
http://goo.gl/dtWSb
Bloom performance is better than Pulsing but the gap narrows with the volumes 
of deletes lying around in old segments, caused by updates. In these cases the 
BloomFilter gives a false positive and falls back to the equivalent operations 
of Pulsing. I added a 100mb start size for the BloomFilter for large-scale 
tests because without this it gets saturated and there were occasional big 
spikes in batch times.
So overall there still looks to be a benefit and especially in low-frequency 
update scenarios.

I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a 
new file) before thinking about committing.

Cheers
Mark



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Updated performance test with option to alter the ratio of inserts vs updates 
via keyspace size.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

2012-07-05 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407099#comment-13407099
 ] 

Mark Harwood commented on LUCENE-4190:
--

-1 for merrily wiping contents of whatever directory a user happens to pick for 
an index location
+0 on requiring all codecs to declare filenames because I take on board Rob's 
points re complexity
+1 for the "_*" name-spacing proposal as a sensible compromise





> IndexWriter deletes non-Lucene files
> 
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
> post: 
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index 
> directory.  We made this change because Codecs are free to write to any files 
> now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will 
> in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe 
> must fit certain pattern eg _(_X).Y), so we are much less likely to 
> delete a non-Lucene file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added customizable saturation threshold after which Bloom filters are retired 
and no longer maintained (due to merges creating v large segments)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-22 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Attached a performance test (adapted from Mike's PKLookupPerfTest) that 
demonstrates the worst-case scenario where BloomFilter offers the 2x speed up 
not previously revealed in Mike's other tests.

This test case mixes reads and writes on a growing index and is representative 
of the real-world scenario I am seeking to optimize. See the javadoc for test 
details.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Greg Bowyer

2012-06-21 Thread mark harwood
Good to have you aboard, Greg!


- Original Message -
From: Erick Erickson 
To: dev@lucene.apache.org
Cc: 
Sent: Thursday, 21 June 2012, 11:56
Subject: Welcome Greg Bowyer

I'm pleased to announce that Greg Bowyer has been added as a
Lucene/Solr committer.

Greg:
It's a tradition that you reply with a brief bio.

Your SVN access should be set up and ready to go.

Congratulations!

Erick Erickson

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKeyPerfTest40.java

Updated Performance test code based on new IndexReader changes for accessing 
subreaders

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   >