from:"Mark Harwood \\\(JIRA\\\)"

[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-07-01 Thread Mark Harwood (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876507#comment-16876507
 ] 

Mark Harwood commented on LUCENE-8876:
--

I reached out the paper author, Donna Harman a while ago and she just replied 
as follows:
{quote}It has been a very long time since I have thought about S-stemmers.   
But looking at your examples of bees and employees, it seems to me that rule 3 
is the correct one because rule 2 would be prevented from firing. 
{quote}
 

Given her assertion that rule 3 should apply to "bees" then it looks like that 
this would make rule 2 entirely redundant.

> EnglishMinimalStemmer does not implement s-stemmer paper correctly?
> ---
>
> Key: LUCENE-8876
> URL: https://issues.apache.org/jira/browse/LUCENE-8876
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
> employees.
> The [original 
> paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]]
>  has this table of rules:
> !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!
> The notes accompanying the table state :
> {quote}"the first applicable rule encountered is the only one used"
> {quote}
>  
> For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
> misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
> != tomato}}. The {{oes}} and {{ees}} suffixes are left intact.
> "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 
> in the table depending on if you take {{applicable}} to mean "the THEN part 
> of the rule has fired" or just that the suffix was referenced in the rule. 
> EnglishMinimalStemmer has assumed the latter and I think it should be the 
> former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove 
> any trailing S). That's certainly the conclusion I came to independently 
> testing on real data.
> There are some additional changes I'd like to see in a plural stemmer but I 
> won't list them here - the focus should be making the code here match the 
> original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-06-24 Thread Mark Harwood (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871423#comment-16871423
 ] 

Mark Harwood commented on LUCENE-8876:
--

{quote} but then doesn't it mean that exceptions of the 2nd rule are always 
ignored?
{quote}
 

Good point. Rule 1 exceptions are odd too - I have not found a single common 
English word that ends in aies or eies.

> EnglishMinimalStemmer does not implement s-stemmer paper correctly?
> ---
>
> Key: LUCENE-8876
> URL: https://issues.apache.org/jira/browse/LUCENE-8876
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
> employees.
> The [original 
> paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]]
>  has this table of rules:
> !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!
> The notes accompanying the table state :
> {quote}"the first applicable rule encountered is the only one used"
> {quote}
>  
> For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
> misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
> != tomato}}. The {{oes}} and {{ees}} suffixes are left intact.
> "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 
> in the table depending on if you take {{applicable}} to mean "the THEN part 
> of the rule has fired" or just that the suffix was referenced in the rule. 
> EnglishMinimalStemmer has assumed the latter and I think it should be the 
> former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove 
> any trailing S). That's certainly the conclusion I came to independently 
> testing on real data.
> There are some additional changes I'd like to see in a plural stemmer but I 
> won't list them here - the focus should be making the code here match the 
> original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

2019-06-24 Thread Mark Harwood (JIRA)

Mark Harwood created LUCENE-8876:


 Summary: EnglishMinimalStemmer does not implement s-stemmer paper 
correctly?
 Key: LUCENE-8876
 URL: https://issues.apache.org/jira/browse/LUCENE-8876
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Mark Harwood


The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
employees.

The [original 
paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]]
 has this table of rules:

!https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!

The notes accompanying the table state :
{quote}"the first applicable rule encountered is the only one used"
{quote}
 

For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
!= tomato}}. The {{oes}} and {{ees}} suffixes are left intact.

"The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in 
the table depending on if you take {{applicable}} to mean "the THEN part of the 
rule has fired" or just that the suffix was referenced in the rule. 
EnglishMinimalStemmer has assumed the latter and I think it should be the 
former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any 
trailing S). That's certainly the conclusion I came to independently testing on 
real data.

There are some additional changes I'd like to see in a plural stemmer but I 
won't list them here - the focus should be making the code here match the 
original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-12 Thread Mark Harwood (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861960#comment-16861960
 ] 

Mark Harwood commented on LUCENE-8840:
--

{quote}we shouldn't favor documents that contain multiple variations of the 
same fuzzy term.
{quote}
 

For fuzzy I agree that rewarding more variations in a doc is probably 
undesirable - a doc will normally pick one spelling for a word and use it 
consistently so any variations are more likely to be false positives (your 
baz/bad example). Plurals and other forms of suffix would be a notable 
exception but I don't think that's too much of a problem because:
 # we can assume that stemming is taking care of normalizing these tokens.
 # a lot of fuzzy querying is for things like people names that aren't 
expressed as plurals or with other common suffixes

 

I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a 
form of score blending for the expansions they create. Wildcards are perhaps 
unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we 
_are_ looking for multiple forms and a document that contains many is better 
than few.

 

> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> -
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
> match the fuzzy terms. This query blends the frequencies used for scoring 
> across the terms and creates a disjunction of all the blended terms. This 
> means that each fuzzy term that match in a document will add their BM25 score 
> contribution. We already have a query that can blend the statistics of 
> multiple terms in a single scorer that sums the doc frequencies rather than 
> the entire BM25 score: the SynonymQuery. Since 
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles 
> boost between 0 and 1 so it should be easy to change the default rewrite 
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This 
> would bound the contribution of each term to the final score which seems a 
> better alternative in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8352) Make TokenStreamComponents final

2018-06-12 Thread Mark Harwood (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509635#comment-16509635
 ] 

Mark Harwood commented on LUCENE-8352:
--

My use case was a bit special. I had a custom reader that [dealt with 
hyperlinked 
text|https://github.com/elastic/elasticsearch/issues/29467#issuecomment-385393246]
 and stripped out the hyperlink markup using a custom Reader before feeding the 
remaining plain-text into tokenisation. The tricky bit was the extracted URLs 
would not be thrown away but passed to a special TokenFilter at the end of the 
chain to inject at the appropriate positions in the text token stream.

The workaround was a custom AnalyzerWrapper that overrode wrapReader (which is 
still invoked when wrapped) and then some ThreadLocal hackery to get my 
TokenFilter connected to the Reader's extracted urls. 

I'm not sure how common this sort of analysis is but before I reached this 
solution there was quite a detour trying to figure out why a custom 
TokenStreamComponents was not working when wrapped.

 

> Make TokenStreamComponents final
> 
>
> Key: LUCENE-8352
> URL: https://issues.apache.org/jira/browse/LUCENE-8352
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Mark Harwood
>Priority: Minor
>
> The current design is a little trappy. Any specialised subclasses of 
> TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
> UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
> them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
> ShingleAnalyzerWrapper and other examples in elasticsearch)_. 
> The current design means each AnalyzerWrapper.wrapComponents() implementation 
> discards any custom TokenStreamComponents and replaces it with one of its own 
> choosing (a vanilla TokenStreamComponents class from examples I've seen).
> This is a trap I fell into when writing a custom TokenStreamComponents with a 
> custom setReader() and I wondered why it was not being triggered when wrapped 
> by other analyzers.
> If AnalyzerWrapper is designed to encourage composition it's arguably a 
> mistake to also permit custom TokenStreamComponent subclasses  - the 
> composition process does not preserve the choice of custom classes and any 
> behaviours they might add. For this reason we should not encourage extensions 
> to TokenStreamComponents (or if TSC extensions are required we should somehow 
> mark an Analyzer as "unwrappable" to prevent lossy compositions).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8352) Make TokenStreamComponents final

2018-06-11 Thread Mark Harwood (JIRA)

Mark Harwood created LUCENE-8352:


 Summary: Make TokenStreamComponents final
 Key: LUCENE-8352
 URL: https://issues.apache.org/jira/browse/LUCENE-8352
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Mark Harwood


The current design is a little trappy. Any specialised subclasses of 
TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
ShingleAnalyzerWrapper and other examples in elasticsearch)_. 

The current design means each AnalyzerWrapper.wrapComponents() implementation 
discards any custom TokenStreamComponents and replaces it with one of its own 
choosing (a vanilla TokenStreamComponents class from examples I've seen).

This is a trap I fell into when writing a custom TokenStreamComponents with a 
custom setReader() and I wondered why it was not being triggered when wrapped 
by other analyzers.

If AnalyzerWrapper is designed to encourage composition it's arguably a mistake 
to also permit custom TokenStreamComponent subclasses  - the composition 
process does not preserve the choice of custom classes and any behaviours they 
might add. For this reason we should not encourage extensions to 
TokenStreamComponents (or if TSC extensions are required we should somehow mark 
an Analyzer as "unwrappable" to prevent lossy compositions).

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-6747.


 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Fix For: Trunk, 5.4

 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch, fingerprintv4.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-6747.
--
Resolution: Fixed

Commited to trunk and 5.x


 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Fix For: Trunk, 5.4

 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch, fingerprintv4.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Fix Version/s: (was: 5.3.1)
   5.4

 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Fix For: Trunk, 5.4

 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch, fingerprintv4.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-27 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Fix Version/s: 5.3.1
   Trunk

 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Fix For: Trunk, 5.3.1

 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch, fingerprintv4.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-25 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv4.patch

Some final tweaks:
1) Found a bug where separator not appended if first token is length ==1
2) Randomized testing identified issue with input.end() not being called when 
IOExceptions occur
3) Added missing SPI entry for FingerprintFilterFactory and associated test 
class for FingerprintFilterFactory

 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch, fingerprintv4.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-21 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv3.patch

Updated patch - removed instanceof check and added entry to Changes.txt.

Will commit to trunk and 5.x in a day or two if there's no objections

 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Attachments: fingerprintv1.patch, fingerprintv2.patch, 
 fingerprintv3.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv2.patch

Thanks for taking a look, Adrien.
Added a v2 patch with following changes:

1) added call to input.end() to get final offset state
2) final state is retained using captureState()  
3) Added a FingerprintFilterFactory class
 
As for the alternative hashing idea :
For speed reasons this would be a nice idea but reduces the read-ability of 
results if you want to debug any collisions or otherwise display connections.

For compactness reasons (storing in doc values etc) it would always be possible 
to chain a conventional hashing algo in a TokenFilter on the end of this 
text-normalizing filter. (Do we already have a conventional hashing 
TokenFilter?)




 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Attachments: fingerprintv1.patch, fingerprintv2.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-19 Thread Mark Harwood (JIRA)

Mark Harwood created LUCENE-6747:


 Summary: FingerprintFilter - a TokenFilter for clustering/linking 
purposes
 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor


A TokenFilter that emits a single token which is a sorted, de-duplicated set of 
the input tokens.
This approach to normalizing text is used in tools like OpenRefine[1] and 
elsewhere [2] to help in clustering or linking texts.
The implementation proposed here has a an upper limit on the size of the 
combined token which is output.

[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
[2] 
https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

2015-08-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6747:
-
Attachment: fingerprintv1.patch

Proposed implementation and test

 FingerprintFilter - a TokenFilter for clustering/linking purposes
 -

 Key: LUCENE-6747
 URL: https://issues.apache.org/jira/browse/LUCENE-6747
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Priority: Minor
 Attachments: fingerprintv1.patch


 A TokenFilter that emits a single token which is a sorted, de-duplicated set 
 of the input tokens.
 This approach to normalizing text is used in tools like OpenRefine[1] and 
 elsewhere [2] to help in clustering or linking texts.
 The implementation proposed here has a an upper limit on the size of the 
 combined token which is output.
 [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
 [2] 
 https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-20 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552265#comment-14552265
 ] 

Mark Harwood commented on LUCENE-329:
-

Committed to 5.x branch and trunk

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Last edits to remove unnecessary Math.max() tests. Added assertion around 
maxTTf expectations

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Updated following review comments (thanks, Adrien).
All tests passing on trunk.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: (was: LUCENE-329.patch)

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Cut-and-paste error in last patch set df=0 and effects were undetected by unit 
tests.
Enhanced unit test to detect error then fixed

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550376#comment-14550376
 ] 

Mark Harwood commented on LUCENE-329:
-

Thanks, I'll commit tomorrow if there's no objections.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-12 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

Switched to the TermContext.accumulateStatistics() method Adrien suggested for 
tweaking stats.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
 LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Attachment: LUCENE-329.patch

New patch addressing this long-standing bug.
Addresses the all-or-nothing choices of today where the default is a (poor) use 
of all IDF factors or a sub-optimal alternative of using a rewrite method with 
no IDF.
The patch includes:
1) A new default FuzzyQuery rewrite method that balances IDF better
2) Unit tests for single and multi-query behaviours

Additionally, this document offers more analysis based on quality tests on a 
slightly larger set of data not included here: 
https://docs.google.com/document/d/1KXhbUpD5GFyzNqfk3nocODOo7Upgpd5tmUQp4-OPwiM/edit#heading=h.2e8gdmdqf2m5


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 3.1, 4.0-ALPHA

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues

2015-05-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-329:

Fix Version/s: (was: 3.1)
   (was: 4.0-ALPHA)
   5.x

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 5.x

 Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-6066) Collector that manages diversity in search results

2015-02-12 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-6066.

   Resolution: Fixed
Fix Version/s: (was: 5.0)
   5.1

Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike.

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.1

 Attachments: LUCENE-6066.patch, LUCENE-PQRemoveV8.patch, 
 LUCENE-PQRemoveV9.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-09 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV9.patch

Move DiversifiedTopDocsCollector and related unit test to misc.
Added experimental annotation.
Removed superfluous if ==0  test in PriorityQueue.

Thanks, Adrien.

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV8.patch, LUCENE-PQRemoveV9.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results

2015-02-06 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309365#comment-14309365
]

Mark Harwood commented on LUCENE-6066:
--

bq. maybe we should have this feature in lucene/sandbox or in lucene/misc first
instead of lucene/core?

It relies on a change to core's PriorityQueue (which was the original focus of
this issue but then the issue extended into being about the specialized
collector that is possibly the only justification for introducing a remove
method on PQ).

bq. I think we should also add a lucene.experimental annotation to this
collector?

That seems fair.

bq. the `if (size == 0)` condition at the top of PQ.remove looks already
covered by the below for-loop?

good point, will change.

bq. Should PQ.downHeap and upHead delegate to their counterpart that takes a
position?

I wanted to avoid the possibility of introducing any slow down to the PQ impl
by keeping the existing upHeap/downHeap methods intact and duplicating most of
their logic in the version that takes a position.

Collector that manages diversity in search results
--

Key: LUCENE-6066
URL: https://issues.apache.org/jira/browse/LUCENE-6066
Project: Lucene - Core
Issue Type: Improvement
Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
Fix For: 5.0

Attachments: LUCENE-PQRemoveV8.patch

This issue provides a new collector for situations where a client doesn't
want more than N matches for any given key (e.g. no more than 5 products from
any one retailer in a marketplace). In these circumstances a document that
was previously thought of as competitive during collection has to be removed
from the final PQ and replaced with another doc (eg a retailer who already
has 5 matches in the PQ receives a 6th match which is better than his
previous ones). This requires a new remove method on the existing
PriorityQueue class.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV7.patch)

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV8.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV6.patch)

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV8.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-02-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV8.patch

Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike).
Will commit to trunk and 5.1 in a day or 2 if no objections. 

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV8.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-22 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV7.patch

Fixed the test PQ's impl of lessThan() which was causing test failures on 
duplicate Integers placed into queue.

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV6.patch, LUCENE-PQRemoveV7.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV5.patch)

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV6.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-19 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV6.patch

Removed outdated acceptDocsInOrder() method.

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV6.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV3.patch)

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV5.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV5.patch

Added Junit test showing use with String based dedup keys using 2 lookup impls 
- slow+accurate global ords and fast but potentially inaccurate hashing of 
BinaryDocValues

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV5.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results

2015-01-14 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277279#comment-14277279
 ] 

Mark Harwood commented on LUCENE-6066:
--

What feels awkward in the example Junit is that diversified collections are not 
compatible with existing Sort functionality - I had to use a custom Similarity 
class to sort by the popularity of songs in my test data. 
Combining the diversified collector with any other form of existing collector 
(e.g. TopFieldCollector to achieve field-based sorting) via wrapping is 
problematic because the other collectors all work with an assumption that 
previously collected elements are never recalled. The diversifying collector 
needs the ability to recall previously collected elements when new elements 
with the same key need to be substituted.

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV5.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results

2015-01-05 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Description: This issue provides a new collector for situations where a 
client doesn't want more than N matches for any given key (e.g. no more than 5 
products from any one retailer in a marketplace). In these circumstances a 
document that was previously thought of as competitive during collection has to 
be removed from the final PQ and replaced with another doc (eg a retailer who 
already has 5 matches in the PQ receives a 6th match which is better than his 
previous ones). This requires a new remove method on the existing PriorityQueue 
class.  (was: It would be useful to be able to remove existing elements from a 
PriorityQueue. 
The proposal is that a linear scan is performed to find the element being 
removed and then the end element in heap[size] is swapped into this position to 
perform the delete. The method downHeap() is then called to shuffle the 
replacement element back down the array but the existing downHeap method must 
be modified to allow picking up an entry from any point in the array rather 
than always assuming the first element (which is its only current mode of 
operation).

A working javascript model of the proposal with animation is available here: 
http://jsfiddle.net/grcmquf2/22/ 

In tests the modified version of downHeap produces the same results as the 
existing impl but adds the ability to push down from any point.

An example use case that requires remove is where a client doesn't want more 
than N matches for any given key (e.g. no more than 5 products from any one 
retailer in a marketplace). In these circumstances a document that was 
previously thought of as competitive has to be removed from the final PQ and 
replaced with another doc (eg a retailer who already has 5 matches in the PQ 
receives a 6th match which is better than his previous ones). This particular 
process is managed by a special DiversifyingPriorityQueue which wraps the 
main PriorityQueue and could be contributed as part of another issue if there 
is interest in that. )
Summary: Collector that manages diversity in search results  (was: New 
remove method in PriorityQueue)

 Collector that manages diversity in search results
 --

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV3.patch


 This issue provides a new collector for situations where a client doesn't 
 want more than N matches for any given key (e.g. no more than 5 products from 
 any one retailer in a marketplace). In these circumstances a document that 
 was previously thought of as competitive during collection has to be removed 
 from the final PQ and replaced with another doc (eg a retailer who already 
 has 5 matches in the PQ receives a 6th match which is better than his 
 previous ones). This requires a new remove method on the existing 
 PriorityQueue class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-12-09 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239328#comment-14239328
]

Mark Harwood commented on LUCENE-6066:
--

Thanks for the review, Mike. I'm working through changes.

bq. Why couldn't you just pass your custom queue instead of null to super() in
DiversifiedTopDocsCollector ctor?

Oops. That was a cut/paste error transferring code from es that relied on a
forked PriorityQueue which is obviously incompatible with the Lucene
TopDocsCollector base class.

bq. the abstract method returns NumericDocValues, which is confusing: how does
beatles become a number? Why not e.g. SortedDVs

I originally had a getKey(docId) method that returned an object - anything
which implements hashCode and Equals. When I talked through with Adrien he
suggested the use of NumericDocValues as a better abstraction which could be
backed by any system based on ordinals. We need to decide on what this
abstraction should be. One of the things I've been grappling with is if the
collector should implement support for multi-keyed docs e.g. a field containing
hashes for near-duplicate detection to avoid too-similar texts. This would
require extra code in the collector to determine if any one key had exceeded
limits (and ideally some memory-safeguard for docs with too many keys).

I saw a test about paging; how does/should paging work with such a collector?

In regular collections, TopScoreDocCollector provides all of the smarts for
in-order/out-of-order and starting from the ScoreDoc at the bottom of the
previous page. I expect I would have to reimplement all of it's logic for a new
DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about
using updateTop() that don't apply when we have a two-tier system for scoring
(globally competitive and within-key competitive).
My vague assumption was that the logic for paging would have to be that any
per-key constraints would apply across multiple pages e.g. having had 5 Beatles
hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go
into the results because it had exhausted the max 5 per key limit. This logic
would probably preclude any use of the deep-paging optimisation where you can
pass just the ScoreDoc of the last entry on the previous page to minimise the
size of the PQ created for subsequent pages.

New remove method in PriorityQueue

Attachments: LUCENE-PQRemoveV3.patch

It would be useful to be able to remove existing elements from a
PriorityQueue.
The proposal is that a linear scan is performed to find the element being
removed and then the end element in heap[size] is swapped into this position
to perform the delete. The method downHeap() is then called to shuffle the
replacement element back down the array but the existing downHeap method must
be modified to allow picking up an entry from any point in the array rather
than always assuming the first element (which is its only current mode of
operation).
A working javascript model of the proposal with animation is available here:
http://jsfiddle.net/grcmquf2/22/
In tests the modified version of downHeap produces the same results as the
existing impl but adds the ability to push down from any point.
An example use case that requires remove is where a client doesn't want more
than N matches for any given key (e.g. no more than 5 products from any one
retailer in a marketplace). In these circumstances a document that was
previously thought of as competitive has to be removed from the final PQ and
replaced with another doc (eg a retailer who already has 5 matches in the PQ
receives a 6th match which is better than his previous ones). This particular
process is managed by a special DiversifyingPriorityQueue which wraps the
main PriorityQueue and could be contributed as part of another issue if there
is interest in that.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue

2014-12-04 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV3.patch

Updated patch.
Added DiversifiedTopDocsCollector and associated test. This class represents 
the primary use-case for wanting to include a new remove() method in 
PriorityQueue.
The PriorityQueue has original upHeap/downHeap methods unchanged in case of any 
performance change and a new specialised upHeap/downHeap that takes  a position 
to support the new remove function.

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV2.patch, LUCENE-PQRemoveV3.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue

2014-12-04 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV2.patch)

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV3.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: (was: LUCENE-PQRemoveV1.patch)

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV2.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV2.patch

Added missing upHeap call to remove method.
Added extra randomized tests and method to check validity of PQ elements as 
mutations are made.

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV2.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-11-24 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223307#comment-14223307
]

Mark Harwood commented on LUCENE-6066:
--

Thanks for your comments, Stefan.

The remove method I believe is implemented correctly now.

bq. it still seems that specialized versions can outperform generic ones

Yes, the DiversifyingPriorityQueue that I imagined would need access to a new
remove method in the existing PriorityQueue looks like it is better implemented
as a fork of the existing PriorityQueue. I'll attach this fork here in a future
addition.
Maybe with these differing implementations there is a need to have a common
interface that provides an abstraction for things like TopDocsCollector to add
and pop results.

New remove method in PriorityQueue

Attachments: LUCENE-PQRemoveV2.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)

Mark Harwood created LUCENE-6066:


 Summary: New remove method in PriorityQueue
 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0


It would be useful to be able to remove existing elements from a PriorityQueue. 
The proposal is that a linear scan is performed to find the element being 
removed and then the end element in heap[size] is swapped into this position to 
perform the delete. The method downHeap() is then called to shuffle the 
replacement element back down the array but the existing downHeap method must 
be modified to allow picking up an entry from any point in the array rather 
than always assuming the first element (which is its only current mode of 
operation).

A working javascript model of the proposal with animation is available here: 
http://jsfiddle.net/grcmquf2/22/ 

In tests the modified version of downHeap produces the same results as the 
existing impl but adds the ability to push down from any point.

An example use case that requires remove is where a client doesn't want more 
than N matches for any given key (e.g. no more than 5 products from any one 
retailer in a marketplace). In these circumstances a document that was 
previously thought of as competitive has to be removed from the final PQ and 
replaced with another doc (eg a retailer who already has 5 matches in the PQ 
receives a 6th match which is better than his previous ones). This particular 
process is managed by a special DiversifyingPriorityQueue which wraps the 
main PriorityQueue and could be contributed as part of another issue if there 
is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-6066:
-
Attachment: LUCENE-PQRemoveV1.patch

New remove(element) method in PriorityQueue and related test

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV1.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219651#comment-14219651
 ] 

Mark Harwood commented on LUCENE-6066:
--

If the PQ set the current array position as a property of each element every 
time it moved them around I could pass the array index to remove() rather than 
an object that had to be scanned for 

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV1.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219822#comment-14219822
 ] 

Mark Harwood commented on LUCENE-6066:
--

I guess it's different from grouping in that: 
1) it only involves one pass over the data
2) the client doesn't have to guess the number of groups he is going to need to 
get up-front
3) We don't get any filler docs in each group's results i.e. a bunch of 
irrelevant docs for an author with one good hit.

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV1.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219901#comment-14219901
 ] 

Mark Harwood commented on LUCENE-6066:
--

An analogy might be making a compilation album of 1967's top hit records:

1) A vanilla Lucene query's results might look like a Best of the Beatles 
album - no diversity
2) A grouping query would produce The 10 top-selling artists of 1967 - some 
killer and quite a lot of filler
3) A diversified query would be the top 20 hit records of that year - with a 
max of 3 Beatles hits to maintain diversity

 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV1.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue

2014-11-20 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220089#comment-14220089
 ] 

Mark Harwood commented on LUCENE-6066:
--

bq. But how will you track the min element for each key in the PQ (to know 
which element to remove, when a more competitive hit with that key arrives)?

I was thinking of this as a foundation: (pseudo code) 

{code:title=DiversifyingPriorityQueue.java|borderStyle=solid}
   abstract class KeyedElement {
   int pqPos;
   abstract Object getKey();
   }
   class DiversifyingPriorityQueueT extends KeyedElement extends 
PriorityQueueT {
FastRemovablePriorityQueueT mainPQ;
MapObject, PriorityQueue perKeyQueues;
  }
{code}

You can probably guess at the logic but it is based around: 
* making sure each key has a max of n entries using an entry in perKeyQueues.
* Evictions from the mainPQ will require removal from the related perKeyQueue
* Emptied perKeyQueues can be recycled for use with other keys
* Evictions from the perKeyQueue will require removal from the mainPQ

bq. This seems promising, maybe as a separate dedicated (forked) PQ impl?

Yes, introducing a linear-cost remove by marking elements with a position is an 
added cost that not all PQs will require so forking seems necessary. In this 
case a common abstraction for these different PQs would be useful for the 
places where results are consumed e.g. TopDocsCollector


 New remove method in PriorityQueue
 

 Key: LUCENE-6066
 URL: https://issues.apache.org/jira/browse/LUCENE-6066
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: Mark Harwood
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-PQRemoveV1.patch


 It would be useful to be able to remove existing elements from a 
 PriorityQueue. 
 The proposal is that a linear scan is performed to find the element being 
 removed and then the end element in heap[size] is swapped into this position 
 to perform the delete. The method downHeap() is then called to shuffle the 
 replacement element back down the array but the existing downHeap method must 
 be modified to allow picking up an entry from any point in the array rather 
 than always assuming the first element (which is its only current mode of 
 operation).
 A working javascript model of the proposal with animation is available here: 
 http://jsfiddle.net/grcmquf2/22/ 
 In tests the modified version of downHeap produces the same results as the 
 existing impl but adds the ability to push down from any point.
 An example use case that requires remove is where a client doesn't want more 
 than N matches for any given key (e.g. no more than 5 products from any one 
 retailer in a marketplace). In these circumstances a document that was 
 previously thought of as competitive has to be removed from the final PQ and 
 replaced with another doc (eg a retailer who already has 5 matches in the PQ 
 receives a 6th match which is better than his previous ones). This particular 
 process is managed by a special DiversifyingPriorityQueue which wraps the 
 main PriorityQueue and could be contributed as part of another issue if there 
 is interest in that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text

2013-07-24 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-725:


Attachment: NovelAnalyzer.java

Updated to work with Lucene 4 APIs. 

 NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all 
 boilerplate text
 ---

 Key: LUCENE-725
 URL: https://issues.apache.org/jira/browse/LUCENE-725
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Mark Harwood
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: NovelAnalyzer.java, NovelAnalyzer.java, 
 NovelAnalyzer.java, NovelAnalyzer.java


 This is a class I have found to be useful for analyzing small (in the 
 hundreds) collections of documents and  removing any duplicate content such 
 as standard disclaimers or repeated text in an exchange of  emails.
 This has applications in sampling query results to identify key phrases, 
 improving speed-reading of results with similar content (eg email 
 threads/forum messages) or just removing duplicated noise from a search index.
 To be more generally useful it needs to scale to millions of documents - in 
 which case an alternative implementation is required. See the notes in the 
 Javadocs for this class for more discussion on this

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4866) Lucene corruption

2013-03-21 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608826#comment-13608826
 ] 

Mark Harwood commented on LUCENE-4866:
--

The fact that the missing file looks to be held on a shared drive might also be 
significant if there is 1 Lucene process configured to access the same 
directory ...

 Lucene corruption
 -

 Key: LUCENE-4866
 URL: https://issues.apache.org/jira/browse/LUCENE-4866
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.5
 Environment: Amazone tomcat cluster with NTFS. 
Reporter: sachin
Priority: Blocker

 Hi all,
 We know that lucene index gets corrupted. in our case they are corrupting 
 again and again due to this production is incosistent. followiing errors are 
 observed. Any help will be helpful.
 org.hibernate.search.SearchException: Unable to reopen IndexReader
 at 
 org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:230)
 at 
 org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:73)
 at 
 org.hibernate.search.reader.impl.MultiReaderFactory.openReader(MultiReaderFactory.java:49)
 at 
 org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:596)
 at 
 org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:495)
 at 
 org.hibernate.search.query.engine.impl.HSQueryImpl.queryEntityInfos(HSQueryImpl.java:239)
 at 
 org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.list(FullTextQueryImpl.java:209)
 at 
 com.lifetech.ngs.dataaccess.spring.util.SearchUtil.returnProjectionData(SearchUtil.java:646)
 at 
 com.lifetech.ngs.dataaccess.spring.util.SearchUtil.getSinglePropertyOnlyUsingSearch(SearchUtil.java:556)
 at 
 com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$FastClassByCGLIB$$568d5972.invoke(generated)
 at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
 at 
 org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)
 at 
 org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
 at 
 org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
 at 
 org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
 at 
 org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622)
 at 
 com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$EnhancerByCGLIB$$47fb00d0.getSinglePropertyOnlyUsingSearch(generated)
 at 
 com.lifetech.ngs.server.impl.SampleManagerImpl.getNameSearchResult(SampleManagerImpl.java:2436)
 at 
 com.lifetech.ngs.server.impl.SampleManagerImpl$$FastClassByCGLIB$$17af181d.invoke(generated)
 at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
 at 
 org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)
 at 
 org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
 at 
 org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
 at 
 org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
 at 
 org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622)
 at 
 com.lifetech.ngs.server.impl.SampleManagerImpl$$EnhancerByCGLIB$$75b745f9.getNameSearchResult(generated)
 at 
 com.lifetech.ngs.webui.mgc.widgets.sample.SearchSamplesView.populateData(SearchSamplesView.java:635)
 at 
 com.lifetech.ngs.webui.customcomponents.IRAutoComplete.changeVariables(IRAutoComplete.java:39)
 at 
 com.vaadin.terminal.gwt.server.AbstractCommunicationManager.changeVariables(AbstractCommunicationManager.java:1445)
 at 
 com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariableBurst(AbstractCommunicationManager.java:1393)
 at 
 com.lifetech.ngs.webui.main.SpringVaadinServlet$1.handleVariableBurst(SpringVaadinServlet.java:57)
 at 
 com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariables(AbstractCommunicationManager.java:1312)
 at 
 com.vaadin.terminal.gwt.server.AbstractCommunicationManager.doHandleUidlRequest(AbstractCommunicationManager.java:763)
 at

[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575740#comment-13575740
]

Mark Harwood commented on LUCENE-4768:
--

As with any discussion about nested queries you need to be very clear about the
required logic. When you talk about matching f1:A or f1:B - are we talking
about matches on the same child doc or possibly matches on different child docs
of the same parent? The examples don't make this clear.
If we assume your child-based criteria is focused on examining the contents of
single children (as opposed to combining f1:A on one child doc with f1:B on a
different child doc) then a BooleanQuery that combines these child query
elements will already be sufficient for skipping through children.

Not really sure what you are trying to optimize anyway with skipping -
parent-child combos are limited to what fits into a single segment which is in
turn limited by RAM. You don't generally get parents with many many children
because of these constraints. The nextDoc calls you are trying to skip are
related to a compressed block of child doc IDs (gap encoded varints) that are
read off disk in 1K chunks (if I recall default Directory settings correctly).
The chances are high that the limited number of child docIDs that belong to
each parent are already in RAM as part of normal disk access patterns so there
is no real saving in disk IO. Are you sure this is a performance bottleneck?

Child Traversable To Parent Block Join Query

Key: LUCENE-4768
URL: https://issues.apache.org/jira/browse/LUCENE-4768
Project: Lucene - Core
Issue Type: Improvement
Components: core/query/scoring
Environment: trunk
git rev-parse HEAD
5cc88eaa41eb66236a0d4203cc81f1eed97c9a41
Reporter: Vadim Kirilchuk
Attachments: LUCENE-4768-draft.patch

Hi everyone!
Let me describe what i am trying to do:
I have hierarchical documents ('car model' as parent, 'trim' as child) and
use block join queries to retrieve them. However, i am not happy with current
behavior of ToParentBlockJoinQuery which goes through all parent childs
during nextDoc call (accumulating scores and freqs).
Consider the following example, you have a query with a custom post condition
on top of such bjq: and during post condition you traverse scorers tree
(doc-at-time) and want to manually push child scorers of bjq one by one until
condition passes or current parent have no more childs.
I am attaching the patch with query(and some tests) similar to
ToParentBlockJoin but with an ability to traverse childs. (i have to do weird
instance of check and cast inside my code) This is a draft only and i will be
glad to hear if someone need it or to hear how we can improve it.
P.s i believe that proposed query is more generic (low level) than
ToParentBJQ and ToParentBJQ can be extended from it and call nextChild()
internally during nextDoc().
Also, i think that the problem of traversing hierarchical documents is more
complex as lucene have only nextDoc API. What do you think about making api
more hierarchy aware? One level document is a special case of multi level
document but not vice versa. WDYT?
Thanks in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575825#comment-13575825
]

Mark Harwood commented on LUCENE-4768:
--

Still not sure what problem you are trying to solve.
bq. i need to know field and text for each matched leaf scorer

Why? For scoring purposes? ToParentBJQ has a configurable ScoreMode to control
if you want the max, avg or sum of the child matches rolled into the combined
parent score. Is that insufficient control for your needs?

Child Traversable To Parent Block Join Query

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query

2013-02-11 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575864#comment-13575864
]

Mark Harwood commented on LUCENE-4768:
--

OK - this problem seems to be about an ill-defined user query (Saturn sky blue
Sedan with no explicit fields) being executed against a well-defined schema
(cars with manufacturers, model names and bodyStyles that also have trims with
colours).

If that's the case you have a heap of problems here which aren't necessarily
related to the block join implementation. One example - IDF ranking being
what it is, if a manufacturer like Ford create a model called the Blue or you
have bad data entry that has an example of this value stored in the wrong field
then Lucene will naturally rank model:blue higher than color:blue because of
the scarcity of the token blue in that field context. That's almost the
inverse of what you want.

A couple of suggestions for field-less queries like your example of Saturn
sky blue sedan
1) Target the query on an unstructured onebox field that holds indexed
content from all fields to achieve a more balanced IDF score.
2) Tokenize each item in the query string and find a most likely field for
each search term by examining doc frequencies e.g. color:blue vs modelName:blue
etc. Augment the onebox query in 1) with the most-likely-field interpretation
for each word in the query string if it has sufficient doc frequency.

Child Traversable To Parent Block Join Query

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476854#comment-13476854
 ] 

Mark Harwood commented on SOLR-3950:


BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat 
and adds .blm files to the other files created by the choice of delegate.

However your code has instantiated a BloomFilterPostingsFormat without passing 
a choice of delegate - presumably using the zero-arg constructor. 
The comments in the code for this zero-arg constructor state:

  // Used only by core Lucene at read-time via Service Provider instantiation -
  // do not use at Write-time in application code.





 Attempting postings=BloomFilter results in UnsupportedOperationException
 --

 Key: SOLR-3950
 URL: https://issues.apache.org/jira/browse/SOLR-3950
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.1
 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
 [root@bigindy5 ~]# java -version
 java version 1.7.0_07
 Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
Reporter: Shawn Heisey
 Fix For: 4.1


 Tested on branch_4x, checked out after BlockPostingsFormat was made the 
 default by LUCENE-4446.
 I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
 copied it into my sharedLib directory.  When I subsequently tried 
 postings=BloomFilter I got a the following exception in the log:
 {code}
 Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.UnsupportedOperationException: Error - 
 org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
 constructed without a choice of PostingsFormat
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477036#comment-13477036
 ] 

Mark Harwood commented on SOLR-3950:


bq. If there is some schema config that will tell Solr to do the right thing, 
please let me know.

Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as 
to what delegate it will use before you can use it at write-time.
I think we have 3 options:

1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF 
e.g. Lucene40PF so you would have a zero-arg-constructor class named something 
like BloomLucene40PF or...
2) Solr extends config file format to provide a generic means of assembling 
wrapper PFs like Bloom in their config e.g:
   postingsFormat=BloomFilter delegatePostingsFormat=FooPF 
   and Solr then does reflection magic to call constructors appropriately or..
3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. 
Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for 
BloomPF.

Of these 1) feels like the right thing.

Cheers
Mark

 Attempting postings=BloomFilter results in UnsupportedOperationException
 --

 Key: SOLR-3950
 URL: https://issues.apache.org/jira/browse/SOLR-3950
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.1
 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
 [root@bigindy5 ~]# java -version
 java version 1.7.0_07
 Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
Reporter: Shawn Heisey
 Fix For: 4.1


 Tested on branch_4x, checked out after BlockPostingsFormat was made the 
 default by LUCENE-4446.
 I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
 copied it into my sharedLib directory.  When I subsequently tried 
 postings=BloomFilter I got a the following exception in the log:
 {code}
 Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.UnsupportedOperationException: Error - 
 org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
 constructed without a choice of PostingsFormat
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work

2012-10-15 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476044#comment-13476044
 ] 

Mark Harwood commented on LUCENE-3772:
--

For bigger-than-memory docs is it not possible to use nested documents to 
represent subsections (e.g. a child doc for each of the chapters in a book) and 
then use BlockJoinQuery to select the best child docs?
Highlighting can then be used on a more-manageable subset of the original 
content and Lucene's ranking algos are being used to select the best fragment 
rather than the highlighter's own attempts to reproduce this logic.

Obviously depends on the shape of your content/queries but books-and-chapters 
is probably a good fit for this approach.

 Highlighter needs the whole text in memory to work
 --

 Key: LUCENE-3772
 URL: https://issues.apache.org/jira/browse/LUCENE-3772
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
 Environment: Windows 7 Enterprise x64, JRE 1.6.0_25
Reporter: Luis Filipe Nassif
  Labels: highlighter, improvement, memory

 Highlighter methods getBestFragment(s) and getBestTextFragments only accept a 
 String object representing the whole text to highlight. When dealing with 
 very large docs simultaneously, it can lead to heap consumption problems. It 
 would be better if the API could accept a Reader objetct additionally, like 
 Lucene Document Fields do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452900#comment-13452900
 ] 

Mark Harwood commented on LUCENE-4369:
--

SingleTermField ?

Not sure matching vs searching is a commonly understood differentiation.

 StringFields name is unintuitive and not helpful
 

 Key: LUCENE-4369
 URL: https://issues.apache.org/jira/browse/LUCENE-4369
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-4369.patch


 There's a huge difference between TextField and StringField, StringField 
 screws up scoring and bypasses your Analyzer.
 (see java-user thread Custom Analyzer Not Called When Indexing as an 
 example.)
 The name we use here is vital, otherwise people will get bad results.
 I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452914#comment-13452914
 ] 

Mark Harwood commented on LUCENE-4369:
--

Agreed on the need for a change - names are important.

I have a problem with using match on its own because the word is often 
associated with partial matching e.g. best match or fuzzy match.
A quick google suggests match has more connotations with fuzziness than 
exactness - there are 162m results for best match vs only 45m results for 
exact match.

So how about ExactMatchField?




 StringFields name is unintuitive and not helpful
 

 Key: LUCENE-4369
 URL: https://issues.apache.org/jira/browse/LUCENE-4369
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-4369.patch


 There's a huge difference between TextField and StringField, StringField 
 screws up scoring and bypasses your Analyzer.
 (see java-user thread Custom Analyzer Not Called When Indexing as an 
 example.)
 The name we use here is vital, otherwise people will get bad results.
 I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters

2012-08-13 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433045#comment-13433045
]

Mark Harwood commented on LUCENE-4069:
--

bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact
use case.

Fair enough - the original patch targeted Lucene 3.6 which benefited more
heavily from this technique. The issue then morphed into a 4.x patch where
performance gains were harder to find.
I think the sweet spot is in primary key searches on indexes with ongoing heavy
changes (more segment fragmentation, less OS-level caching?). This is the use
case I am targeting currently and my final tests using our primary-key-counting
test rig saw a 10 to 15% improvement over Pulsing.

bq. I'm asking because I need his feature but I'm stuck with 3.x for a while.

I have a client in a similar situation who are contemplating using the 3.6
patch.

bq. Is there bugs which should be fixed in initial 3.6 patch?

It has been a while since I looked at it - a quick run of ant test on my copy
here showed no errors. I will be giving it a closer review if my client decides
to go down this route and can post any fixes here.
I expect if you use the patch and get into trouble you can use an un-patched
version of 3.6 to read the same index files (it should just ignore the extra
blm files created by the patched version).

Segment-level Bloom filters
---

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Core
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
Fix For: 4.0-BETA, 5.0

Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch,
LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch,
MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java,
PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java,
PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java

An addition to each segment which stores a Bloom filter for selected fields
in order to give fast-fail to term searches, helping avoid wasted disk access.
Best suited for low-frequency fields e.g. primary keys on big indexes with
many segments but also speeds up general searching in my tests.
Overview slideshow here:
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
Patch based on 3.6 codebase attached.
There are no 3.6 API changes currently - to play just add a field with _blm
on the end of the name to invoke special indexing/querying capability.
Clearly a new Field or schema declaration(!) would need adding to APIs to
configure the service properly.
Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-4069.
--

Resolution: Fixed
  Assignee: Mark Harwood

Committed to 4.0 branch, revision 1368442

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427322#comment-13427322
 ] 

Mark Harwood commented on LUCENE-4069:
--

Will do.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Fix Version/s: 5.0

Applied to trunk in revision 1368567

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426481#comment-13426481
 ] 

Mark Harwood commented on LUCENE-4275:
--


Nailed it, Mike. Yet another beer I owe you.
I removed the IllegalStateException and it looks like the retry logic is now 
kicking in and all tests pass 

This reliance on throwing a particular exception type feels like an important 
contract to document. Currently the comments in PostingsFormat.fieldsProducer() 
read as follows:

bq.   Reads a segment.  NOTE: by the time this call returns, it must hold open 
any files it will need to use; else, those files may be deleted. 

I propose adding:

bq. Additionally, required files may be deleted during the execution of this 
call before there is a chance to open them. Under these circumstances an 
IOException should be thrown by the implementation. IOExceptions are expected 
and will automatically cause a retry of the segment opening logic with the 
newly revised segments

I'll roll that documentation addition into my Lucene-4069 patch


 Threaded tests with MockDirectoryWrapper delete active PostingFormat files
 --

 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0

 Attachments: Lucene-4275-TestClass.patch


 As part of testing Lucene-4069 I have encountered sporadic issues with files 
 going missing. I believe this is a bug in the test framework (multi-threading 
 issues in MockDirectoryWrapper?) so have raised a separate issue with 
 simplified test PostingFormat class here.
 Using this test PF will fail due to a missing file roughly one in four times 
 of executing this test:
 ant test-core  -Dtestcase=TestIndexWriterCommit 
 -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
 -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
 -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-4275.


Resolution: Not A Problem

 Threaded tests with MockDirectoryWrapper delete active PostingFormat files
 --

 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0

 Attachments: Lucene-4275-TestClass.patch


 As part of testing Lucene-4069 I have encountered sporadic issues with files 
 going missing. I believe this is a bug in the test framework (multi-threading 
 issues in MockDirectoryWrapper?) so have raised a separate issue with 
 simplified test PostingFormat class here.
 Using this test PF will fail due to a missing file roughly one in four times 
 of executing this test:
 ant test-core  -Dtestcase=TestIndexWriterCommit 
 -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
 -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
 -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
 LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated with fix to issue explored in Lucene-4275

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
 LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated patch to bring in line with latest core API changes.
All tests now pass clean so will commit soon

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)

Mark Harwood created LUCENE-4275:


 Summary: Threaded tests with MockDirectoryWrapper delete active 
PostingFormat files
 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0


As part of testing Lucene-4069 I have encountered sporadic issues with files 
going missing. I believe this is a bug in the test framework (multi-threading 
issues in MockDirectoryWrapper?) so have raised a separate issue with 
simplified test PostingFormat class here.
Using this test PF will fail due to a missing file roughly one in four times of 
executing this test:
ant test-core  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4275:
-

Attachment: Lucene-4275-TestClass.patch

Attached simple PostingsFormat used to illustrate cases of files going missing 
in PF tests.

 Threaded tests with MockDirectoryWrapper delete active PostingFormat files
 --

 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0

 Attachments: Lucene-4275-TestClass.patch


 As part of testing Lucene-4069 I have encountered sporadic issues with files 
 going missing. I believe this is a bug in the test framework (multi-threading 
 issues in MockDirectoryWrapper?) so have raised a separate issue with 
 simplified test PostingFormat class here.
 Using this test PF will fail due to a missing file roughly one in four times 
 of executing this test:
 ant test-core  -Dtestcase=TestIndexWriterCommit 
 -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
 -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
 -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425895#comment-13425895
 ] 

Mark Harwood commented on LUCENE-4275:
--

Thanks, Rob. This test requires a call to ant clean between each run before 
it will consistently work. However, I don't consider that a fix and assume that 
we are still looking for a bug here as there's an index consistency issue 
lurking somewhere here. I've tried adding the setting 
-Dtests.directory=RAMDirectory but the test still looks to have some memory 
between runs.

I added some logging of creates and deletes as you suggest and it looks like on 
a second, un-cleansed run, my PF is being called to open a high-numbered 
segment which I suspect was created by an earlier run as the logging doesn't 
show signs of the PF being asked to created content for this (or any other) 
segment as part of the current run. At this point it fails as there is no 
longer a copy of  the foobar file listed by the directory.
I have noticed in the logs from previous runs MDW is asked to delete the 
segment's foobar file by IndexWriter as part of compaction into a compound 
CFS.

Hope this sheds some light as I'm finding this a complex one to debug.


 Threaded tests with MockDirectoryWrapper delete active PostingFormat files
 --

 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0

 Attachments: Lucene-4275-TestClass.patch


 As part of testing Lucene-4069 I have encountered sporadic issues with files 
 going missing. I believe this is a bug in the test framework (multi-threading 
 issues in MockDirectoryWrapper?) so have raised a separate issue with 
 simplified test PostingFormat class here.
 Using this test PF will fail due to a missing file roughly one in four times 
 of executing this test:
 ant test-core  -Dtestcase=TestIndexWriterCommit 
 -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
 -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
 -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: 4069Failure.zip

Attached a log of thread activity showing how 
TestIndexWriterCommit.testCommitThreadSafety() is failing.
At this stage I can't tell if this is a failing in MockDirectoryWrapper or the 
test or the BloomPF class but it is related to files being removed unexpectedly.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418314#comment-13418314
 ] 

Mark Harwood commented on LUCENE-4069:
--

One more remaining issue before I commit which has appeared sporadically and 
looks to be consistently raised by this test:
ant test  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast 
-Dtests.file.encoding=ISO-8859-1

The error it produces is this: 
[junit4:junit4] Caused by: java.lang.IllegalStateException: Missing 
file:_9_TestBloomFilteredLucene40Postings_0.blm
[junit4:junit4]at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.init(BloomFilteringPostingsFormat.java:175)
[junit4:junit4]at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156)


MockDirectoryWrapper looks to be randomly deleting files (probably my blm 
file shown above) to simulate the effects of crashes.
Presumably I am doing the right thing in always throwing an exception if the 
.blm file is missing? The alternative would be to silently ignore the missing 
file which seems undesirable.
IF MDW is intended to only delete uncommitted files I'm not sure how we end up 
in a scenario where BloomPF is being asked to open the uncommitted segment?








 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418411#comment-13418411
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I wonder if it has to do w/ only opening the file in the close() method (

Just tried opening the file earlier (in BloomFilteredConsumer constructor) and 
that didn't fix it.
I previously also added an extra Directory.fileExists() sanity check 
immediately after closing the IndexOutput and all was well so I think it's 
something happening after that. Will need to dig deeper.
I'm running on WinXP 64bit if that is of any significance to MDW's behaviour.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416007#comment-13416007
]

Mark Harwood commented on LUCENE-4069:
--

bq. At a minimum I think before committing we should make the SegmentWriteState
accessible.

OK. Will that be the subject of a new Jira?

bq. Hmm why is anonymity at search time important?

It would seem to be an established design principle - see
https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726

It would be a pain if user config settings require a custom SPI-registered
class around just to decode the index contents. There's the resource/classpath
hell, the chance for misconfiguration and running Luke suddenly gets more
complex.
The line to be drawn is between what are just config settings (field names,
memory limits) and what are fundamentally different file formats (e.g. codec
choices).
The design principle that looks to be adopted is that the former ought to be
accommodated without the need for custom SPI-registered classes and the latter
would need to locate an implementation via SPI to decode stored content. Seems
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they
all produce an int) so I would suggest we treat this as a config setting rather
than a fundamentally different choice of file format.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0

Attachments: BloomFilterPostingsBranch4x.patch,
LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch,
MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java,
PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java,
PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416037#comment-13416037
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  If a special decoder for foobar is needed, it must be loadable by SPI. 

I think we are in agreement on the broad principles. The fundamental question 
here though is do you want to treat an index's choice of Hash algo as something 
that would require a new SPI-registered PostingsFormat to decode or can that be 
handled as I have done here with a general purpose SPI framework for hashing 
algos? 

Actually, re-thinking this, I suspect rather than creating our own, I can use 
Java's existing SPI framework for hashing in the form of MessageDigest. I'll 
take a closer look into that...



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416084#comment-13416084
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. MessageDigest.getInstance(name) should be the way to go

I'm less keen now - a quick scan of the docs around MessageDigest throws up 
some issues:
1) SPI registration of MessageDigest providers looks to get into permissions 
hell as it is closely related to security - see 
http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling
 which talks about the steps required to approve a trusted provider.
2) MessageDigest as an interface is designed to stream content in potentially 
many method calls past the hashing algo. MurmurHash2.java is not currently 
written to process content this way and suits our needs in hashing small blocks 
of content in one hit. 

For these 2 reasons it looks like MessageDigest may be a pain to adopt and the 
existing approach proposed in this patch may be preferable.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

New patch with use of SegmentWriteState to right-size the choice if bitset for 
volume of content.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416383#comment-13416383
]

Mark Harwood commented on LUCENE-4069:
--

A quick benchmark looks like the new right-sized bitset as opposed to the old
worst-case-scenario-sized bitset is buying us a small performance improvement.

bq. I also don't think this PF should be per-field

There was a lengthy discussion earlier on this topic. The approach presented
here seems reasonable.
For the average user you have the DefaultBloomFilterFactory default which now
has reasonable sizing for all fields passed its way (assuming a heuristic based
on numDocs=numKeys to anticipate). For expert users you can provide a
BloomFilterFactory with a custom choice of sizing heuristic per-field and can
also simply return null for non-bloomed fields.

Having a single, carefully configured BloomPF wrapper is preferable because you
can channel appropriately configured bloom settings to a common PF delegate and
avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart
enough to figure out that these Bloom-ing choices do not require different
physical files for all the delegated tii etc structures.

You don't *have* to use the Per-field stuff in BloomPF but there are benefits
to be had in doing so which can't otherwise be achieved.

bq. Can you add @lucene.experimental to all the new APIs?

Done.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added bloom package.html and changes.txt. I plan to commit in a day or two if 
there are no objections.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415362#comment-13415362
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. It's the unique term count (for this one segment) that you need right? 
Yes, I need it before I start processing the stream of terms being flushed.
 
bq. Seems like LUCENE-4198 needs to solve this same problem.

Another possibly related point on more access to merge context - custom 
codecs have a great opportunity at merge time to piggy-back some analysis on 
the data being streamed e.g. to spot trending terms whose term frequencies 
differ drastically between the merging source segments. This would require 
access to source segment as term postings are streamed to observe the change 
in counts. 

bq. Also, why do we need to use SPI to find the HashFunction? Seems like 
overkill... we don't (yet) have a bunch of hash functions that are vying here 
right?

There's already a MurmurHash3 algo - we're currently using v2 and so could 
anticipate an upgrade at some stage. This patch provides that future proofing.

bq. can't the postings format impl pass in an instance of HashFunction when 
making the FuzzySet

I don't think that is going to work. Currently all PostingFormat impls that 
extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). 
All their settings (fields, hash algo, thresholds) etc are recorded at write 
time by the base class in the segment. At read-time it is the 
BloomFilterPostingsFormat base class that is instantiated, not the write-time 
subclass and so we need to store the hash algo choice. We can't rely on the 
original subclass being around and configured appropriately with the original 
write-time choice of hashing function.

I think the current way feels safer over all and also allows other Lucene 
functions to safely record hashes along with a hashname string that can be used 
to reconstitute results. 

bq. Can you move the imports under the copyright header?

Will do



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-10 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410145#comment-13410145
]

Mark Harwood commented on LUCENE-4069:
--

bq. So now we are close to 1M lookups/sec for a single thread!

Cool!

bq. I wonder if somehow we can do a better job picking the right sized bit
vector up front?
bq. You basically need to know up front how many unique terms will be in the
given field for this segment right?

Yes - the job of anticipating the number of unique keys probably has 2
different contexts:
1) Net new segments e.g. guessing up front how many docs/keys a user is likely
to generate in a new segment before the flush settings kick in.
2) Merged segments e.g. guessing how many unique keys survive a merge operation

Estimating key volumes in context 1 is probably hard without some additional
hints from the end user. Arguably the BloomFilterFactory.getSetForField()
method already represents where this setting can be controlled.
In context 2 where potentially large merges occur we could look at adding an
extra method to BloomFilterFactory to handle this different context e.g.
something like
FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext)
Based on the size of the segments being merged and volumes of deletes a more
appropriate size of Bloom bitset could be allocated based on a worst-case
estimate.
Not sure how we get the OneMerge instance fed through the call stack - could
that be held somewhere on a ThreadLocal as generally useful context?

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Updated performance test with option to alter the ratio of inserts vs updates 
via keyspace size.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408097#comment-13408097
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the extra tests, Mike. That's tightened performance but that lookss 
a scary amount of code for the optimal solution of this basic incrementing 
operation :)

I've done some more benchmarks with the updated test and the performance 
characteristics are becoming clearer as shown in these results: 
http://goo.gl/dtWSb
Bloom performance is better than Pulsing but the gap narrows with the volumes 
of deletes lying around in old segments, caused by updates. In these cases the 
BloomFilter gives a false positive and falls back to the equivalent operations 
of Pulsing. I added a 100mb start size for the BloomFilter for large-scale 
tests because without this it gets saturated and there were occasional big 
spikes in batch times.
So overall there still looks to be a benefit and especially in low-frequency 
update scenarios.

I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a 
new file) before thinking about committing.

Cheers
Mark



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

2012-07-05 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407099#comment-13407099
 ] 

Mark Harwood commented on LUCENE-4190:
--

-1 for merrily wiping contents of whatever directory a user happens to pick for 
an index location
+0 on requiring all codecs to declare filenames because I take on board Rob's 
points re complexity
+1 for the _* name-spacing proposal as a sensible compromise





 IndexWriter deletes non-Lucene files
 

 Key: LUCENE-4190
 URL: https://issues.apache.org/jira/browse/LUCENE-4190
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4190.patch, LUCENE-4190.patch


 Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
 post: 
 http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
 IndexWriter will now (as of 4.0) delete all foreign files from the index 
 directory.  We made this change because Codecs are free to write to any files 
 now, so the space of filenames is hard to bound.
 But if the user accidentally uses the wrong directory (eg c:/) then we will 
 in fact delete important stuff.
 I think we can at least use some simple criteria (must start with _, maybe 
 must fit certain pattern eg _base36(_X).Y), so we are much less likely to 
 delete a non-Lucene file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added customizable saturation threshold after which Bloom filters are retired 
and no longer maintained (due to merges creating v large segments)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-22 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Attached a performance test (adapted from Mike's PKLookupPerfTest) that 
demonstrates the worst-case scenario where BloomFilter offers the 2x speed up 
not previously revealed in Mike's other tests.

This test case mixes reads and writes on a growing index and is representative 
of the real-world scenario I am seeking to optimize. See the javadoc for test 
details.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Fix for the not downsizing bug and a subsequent issue which that fix 
revealed. The 2nd issue was that on saturation, the downsize method would 
actually upsize into a bigger bitset. This causes false negatives on searches - 
it's safe to downsize the indexing bitset but not upsize as there is already 
some information loss involved.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKeyPerfTest40.java)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKeyPerfTest40.java

Updated Performance test code based on new IndexReader changes for accessing 
subreaders

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396934#comment-13396934
 ] 

Mark Harwood commented on LUCENE-4069:
--

Mike, currently having various issues getting this benchmark framework up and 
running on my Windows platform here - is it easy for you to kick off another 
run with the latest patch on your setup? The latest change to the patch 
shouldn't require an index rebuild from your last run.

No worries if this is too much hassle for you - I'll probably just try switch 
to testing on OSX at home.

Cheers,
Mark

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397054#comment-13397054
]

Mark Harwood commented on LUCENE-4069:
--

bq. problem: I'll run perf test again. It's easy...

Great, thanks.

bq. Alas it's not easy ... please report back on how to make it easier to set
up!

My Windows-based woes were:
1) Had to install python (used 2.7)
2) Figure out python proxy settings for Wikipedia download
3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't
installed/available so gave up and did svn checkout manually
4) Ran first python test and it aborted with complaint about GnuPlot missing

I imagine most of what is needed here comes out of the box on typical OSX/Linux
setup.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395773#comment-13395773
]

Mark Harwood commented on LUCENE-4069:
--

Interesting results, Mike - thanks for taking the time to run them.

bq. BloomFilteredFieldsProducer should just pass through intersect to the
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the
client app and the delegate PostingsFormat as soon as it is safe to do so i.e.
when the user is safely focused on a non-filtered field. While there is a
chance the client may end up making a call to TermsEnum.seekExact(..) on a
filtered field then I need to have a wrapper object in place which is in a
position to intercept this call. In all other method invocations I just end up
delegating calls so I wonder if all these extra method calls are the cause of
the slowdown you see e.g. when Fuzzy is enumerating over many terms.
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for
just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to
live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I
would hope that PostingsFormat.fieldsProducer() isn't called more than once for
the same segment as that's where the Bloom filters get loaded from disk so
there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an
index when loading (I typically use the Wikipedia links as a test set). I look
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over
the comparatively slower 3.6 codebase.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395857#comment-13395857
]

Mark Harwood commented on LUCENE-4069:
--

bq. I think the fix is simple: you are not overriding Terms.intersect now, in
BloomFilteredTerms

Good catch - a quick test indeed shows a speed up on fuzzy queries.
I'll prepare a new patch.

I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a
closer look at your benchmark.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 >

1 - 100 of 360 matches

Mail list logo