[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
[ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876507#comment-16876507 ] Mark Harwood commented on LUCENE-8876: -- I reached out the paper author, Donna Harman a while ago and she just replied as follows: {quote}It has been a very long time since I have thought about S-stemmers. But looking at your examples of bees and employees, it seems to me that rule 3 is the correct one because rule 2 would be prevented from firing. {quote} Given her assertion that rule 3 should apply to "bees" then it looks like that this would make rule 2 entirely redundant. > EnglishMinimalStemmer does not implement s-stemmer paper correctly? > --- > > Key: LUCENE-8876 > URL: https://issues.apache.org/jira/browse/LUCENE-8876 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and > employees. > The [original > paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]] > has this table of rules: > !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! > The notes accompanying the table state : > {quote}"the first applicable rule encountered is the only one used" > {quote} > > For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer > misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes > != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. > "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 > in the table depending on if you take {{applicable}} to mean "the THEN part > of the rule has fired" or just that the suffix was referenced in the rule. > EnglishMinimalStemmer has assumed the latter and I think it should be the > former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove > any trailing S). That's certainly the conclusion I came to independently > testing on real data. > There are some additional changes I'd like to see in a plural stemmer but I > won't list them here - the focus should be making the code here match the > original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
[ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871423#comment-16871423 ] Mark Harwood commented on LUCENE-8876: -- {quote} but then doesn't it mean that exceptions of the 2nd rule are always ignored? {quote} Good point. Rule 1 exceptions are odd too - I have not found a single common English word that ends in aies or eies. > EnglishMinimalStemmer does not implement s-stemmer paper correctly? > --- > > Key: LUCENE-8876 > URL: https://issues.apache.org/jira/browse/LUCENE-8876 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and > employees. > The [original > paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]] > has this table of rules: > !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! > The notes accompanying the table state : > {quote}"the first applicable rule encountered is the only one used" > {quote} > > For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer > misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes > != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. > "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 > in the table depending on if you take {{applicable}} to mean "the THEN part > of the rule has fired" or just that the suffix was referenced in the rule. > EnglishMinimalStemmer has assumed the latter and I think it should be the > former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove > any trailing S). That's certainly the conclusion I came to independently > testing on real data. > There are some additional changes I'd like to see in a plural stemmer but I > won't list them here - the focus should be making the code here match the > original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
Mark Harwood created LUCENE-8876: Summary: EnglishMinimalStemmer does not implement s-stemmer paper correctly? Key: LUCENE-8876 URL: https://issues.apache.org/jira/browse/LUCENE-8876 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Mark Harwood The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees. The [original paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828=rep1=pdf]] has this table of rules: !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! The notes accompanying the table state : {quote}"the first applicable rule encountered is the only one used" {quote} For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in the table depending on if you take {{applicable}} to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any trailing S). That's certainly the conclusion I came to independently testing on real data. There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery
[ https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861960#comment-16861960 ] Mark Harwood commented on LUCENE-8840: -- {quote}we shouldn't favor documents that contain multiple variations of the same fuzzy term. {quote} For fuzzy I agree that rewarding more variations in a doc is probably undesirable - a doc will normally pick one spelling for a word and use it consistently so any variations are more likely to be false positives (your baz/bad example). Plurals and other forms of suffix would be a notable exception but I don't think that's too much of a problem because: # we can assume that stemming is taking care of normalizing these tokens. # a lot of fuzzy querying is for things like people names that aren't expressed as plurals or with other common suffixes I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a form of score blending for the expansions they create. Wildcards are perhaps unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we _are_ looking for multiple forms and a document that contains many is better than few. > TopTermsBlendedFreqScoringRewrite should use SynonymQuery > - > > Key: LUCENE-8840 > URL: https://issues.apache.org/jira/browse/LUCENE-8840 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Major > Attachments: LUCENE-8840.patch > > > Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite > method for Fuzzy queries, uses the BlendedTermQuery to score documents that > match the fuzzy terms. This query blends the frequencies used for scoring > across the terms and creates a disjunction of all the blended terms. This > means that each fuzzy term that match in a document will add their BM25 score > contribution. We already have a query that can blend the statistics of > multiple terms in a single scorer that sums the doc frequencies rather than > the entire BM25 score: the SynonymQuery. Since > https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles > boost between 0 and 1 so it should be easy to change the default rewrite > method for Fuzzy queries to use it instead of the BlendedTermQuery. This > would bound the contribution of each term to the final score which seems a > better alternative in terms of relevancy than the current solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8352) Make TokenStreamComponents final
[ https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509635#comment-16509635 ] Mark Harwood commented on LUCENE-8352: -- My use case was a bit special. I had a custom reader that [dealt with hyperlinked text|https://github.com/elastic/elasticsearch/issues/29467#issuecomment-385393246] and stripped out the hyperlink markup using a custom Reader before feeding the remaining plain-text into tokenisation. The tricky bit was the extracted URLs would not be thrown away but passed to a special TokenFilter at the end of the chain to inject at the appropriate positions in the text token stream. The workaround was a custom AnalyzerWrapper that overrode wrapReader (which is still invoked when wrapped) and then some ThreadLocal hackery to get my TokenFilter connected to the Reader's extracted urls. I'm not sure how common this sort of analysis is but before I reached this solution there was quite a detour trying to figure out why a custom TokenStreamComponents was not working when wrapped. > Make TokenStreamComponents final > > > Key: LUCENE-8352 > URL: https://issues.apache.org/jira/browse/LUCENE-8352 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The current design is a little trappy. Any specialised subclasses of > TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, > UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap > them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, > ShingleAnalyzerWrapper and other examples in elasticsearch)_. > The current design means each AnalyzerWrapper.wrapComponents() implementation > discards any custom TokenStreamComponents and replaces it with one of its own > choosing (a vanilla TokenStreamComponents class from examples I've seen). > This is a trap I fell into when writing a custom TokenStreamComponents with a > custom setReader() and I wondered why it was not being triggered when wrapped > by other analyzers. > If AnalyzerWrapper is designed to encourage composition it's arguably a > mistake to also permit custom TokenStreamComponent subclasses - the > composition process does not preserve the choice of custom classes and any > behaviours they might add. For this reason we should not encourage extensions > to TokenStreamComponents (or if TSC extensions are required we should somehow > mark an Analyzer as "unwrappable" to prevent lossy compositions). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8352) Make TokenStreamComponents final
Mark Harwood created LUCENE-8352: Summary: Make TokenStreamComponents final Key: LUCENE-8352 URL: https://issues.apache.org/jira/browse/LUCENE-8352 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Mark Harwood The current design is a little trappy. Any specialised subclasses of TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, ShingleAnalyzerWrapper and other examples in elasticsearch)_. The current design means each AnalyzerWrapper.wrapComponents() implementation discards any custom TokenStreamComponents and replaces it with one of its own choosing (a vanilla TokenStreamComponents class from examples I've seen). This is a trap I fell into when writing a custom TokenStreamComponents with a custom setReader() and I wondered why it was not being triggered when wrapped by other analyzers. If AnalyzerWrapper is designed to encourage composition it's arguably a mistake to also permit custom TokenStreamComponent subclasses - the composition process does not preserve the choice of custom classes and any behaviours they might add. For this reason we should not encourage extensions to TokenStreamComponents (or if TSC extensions are required we should somehow mark an Analyzer as "unwrappable" to prevent lossy compositions). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-6747. FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Fix For: Trunk, 5.4 Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch, fingerprintv4.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-6747. -- Resolution: Fixed Commited to trunk and 5.x FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Fix For: Trunk, 5.4 Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch, fingerprintv4.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Fix Version/s: (was: 5.3.1) 5.4 FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Fix For: Trunk, 5.4 Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch, fingerprintv4.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Fix Version/s: 5.3.1 Trunk FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Fix For: Trunk, 5.3.1 Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch, fingerprintv4.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv4.patch Some final tweaks: 1) Found a bug where separator not appended if first token is length ==1 2) Randomized testing identified issue with input.end() not being called when IOExceptions occur 3) Added missing SPI entry for FingerprintFilterFactory and associated test class for FingerprintFilterFactory FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch, fingerprintv4.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv3.patch Updated patch - removed instanceof check and added entry to Changes.txt. Will commit to trunk and 5.x in a day or two if there's no objections FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Attachments: fingerprintv1.patch, fingerprintv2.patch, fingerprintv3.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv2.patch Thanks for taking a look, Adrien. Added a v2 patch with following changes: 1) added call to input.end() to get final offset state 2) final state is retained using captureState() 3) Added a FingerprintFilterFactory class As for the alternative hashing idea : For speed reasons this would be a nice idea but reduces the read-ability of results if you want to debug any collisions or otherwise display connections. For compactness reasons (storing in doc values etc) it would always be possible to chain a conventional hashing algo in a TokenFilter on the end of this text-normalizing filter. (Do we already have a conventional hashing TokenFilter?) FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Attachments: fingerprintv1.patch, fingerprintv2.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
Mark Harwood created LUCENE-6747: Summary: FingerprintFilter - a TokenFilter for clustering/linking purposes Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv1.patch Proposed implementation and test FingerprintFilter - a TokenFilter for clustering/linking purposes - Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor Attachments: fingerprintv1.patch A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552265#comment-14552265 ] Mark Harwood commented on LUCENE-329: - Committed to 5.x branch and trunk Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Last edits to remove unnecessary Math.max() tests. Added assertion around maxTTf expectations Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Updated following review comments (thanks, Adrien). All tests passing on trunk. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: (was: LUCENE-329.patch) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Cut-and-paste error in last patch set df=0 and effects were undetected by unit tests. Enhanced unit test to detect error then fixed Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550376#comment-14550376 ] Mark Harwood commented on LUCENE-329: - Thanks, I'll commit tomorrow if there's no objections. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Switched to the TermContext.accumulateStatistics() method Adrien suggested for tweaking stats. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch New patch addressing this long-standing bug. Addresses the all-or-nothing choices of today where the default is a (poor) use of all IDF factors or a sub-optimal alternative of using a rewrite method with no IDF. The patch includes: 1) A new default FuzzyQuery rewrite method that balances IDF better 2) Unit tests for single and multi-query behaviours Additionally, this document offers more analysis based on quality tests on a slightly larger set of data not included here: https://docs.google.com/document/d/1KXhbUpD5GFyzNqfk3nocODOo7Upgpd5tmUQp4-OPwiM/edit#heading=h.2e8gdmdqf2m5 Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 3.1, 4.0-ALPHA Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Fix Version/s: (was: 3.1) (was: 4.0-ALPHA) 5.x Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 5.x Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-6066. Resolution: Fixed Fix Version/s: (was: 5.0) 5.1 Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.1 Attachments: LUCENE-6066.patch, LUCENE-PQRemoveV8.patch, LUCENE-PQRemoveV9.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV9.patch Move DiversifiedTopDocsCollector and related unit test to misc. Added experimental annotation. Removed superfluous if ==0 test in PriorityQueue. Thanks, Adrien. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV8.patch, LUCENE-PQRemoveV9.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309365#comment-14309365 ] Mark Harwood commented on LUCENE-6066: -- bq. maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core? It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a remove method on PQ). bq. I think we should also add a lucene.experimental annotation to this collector? That seems fair. bq. the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop? good point, will change. bq. Should PQ.downHeap and upHead delegate to their counterpart that takes a position? I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV8.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV7.patch) Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV8.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV6.patch) Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV8.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV8.patch Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike). Will commit to trunk and 5.1 in a day or 2 if no objections. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV8.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV7.patch Fixed the test PQ's impl of lessThan() which was causing test failures on duplicate Integers placed into queue. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV6.patch, LUCENE-PQRemoveV7.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV5.patch) Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV6.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV6.patch Removed outdated acceptDocsInOrder() method. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV6.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV3.patch) Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV5.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV5.patch Added Junit test showing use with String based dedup keys using 2 lookup impls - slow+accurate global ords and fast but potentially inaccurate hashing of BinaryDocValues Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV5.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277279#comment-14277279 ] Mark Harwood commented on LUCENE-6066: -- What feels awkward in the example Junit is that diversified collections are not compatible with existing Sort functionality - I had to use a custom Similarity class to sort by the popularity of songs in my test data. Combining the diversified collector with any other form of existing collector (e.g. TopFieldCollector to achieve field-based sorting) via wrapping is problematic because the other collectors all work with an assumption that previously collected elements are never recalled. The diversifying collector needs the ability to recall previously collected elements when new elements with the same key need to be substituted. Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV5.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Description: This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. (was: It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. ) Summary: Collector that manages diversity in search results (was: New remove method in PriorityQueue) Collector that manages diversity in search results -- Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV3.patch This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239328#comment-14239328 ] Mark Harwood commented on LUCENE-6066: -- Thanks for the review, Mike. I'm working through changes. bq. Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor? Oops. That was a cut/paste error transferring code from es that relied on a forked PriorityQueue which is obviously incompatible with the Lucene TopDocsCollector base class. bq. the abstract method returns NumericDocValues, which is confusing: how does beatles become a number? Why not e.g. SortedDVs I originally had a getKey(docId) method that returned an object - anything which implements hashCode and Equals. When I talked through with Adrien he suggested the use of NumericDocValues as a better abstraction which could be backed by any system based on ordinals. We need to decide on what this abstraction should be. One of the things I've been grappling with is if the collector should implement support for multi-keyed docs e.g. a field containing hashes for near-duplicate detection to avoid too-similar texts. This would require extra code in the collector to determine if any one key had exceeded limits (and ideally some memory-safeguard for docs with too many keys). I saw a test about paging; how does/should paging work with such a collector? In regular collections, TopScoreDocCollector provides all of the smarts for in-order/out-of-order and starting from the ScoreDoc at the bottom of the previous page. I expect I would have to reimplement all of it's logic for a new DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about using updateTop() that don't apply when we have a two-tier system for scoring (globally competitive and within-key competitive). My vague assumption was that the logic for paging would have to be that any per-key constraints would apply across multiple pages e.g. having had 5 Beatles hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go into the results because it had exhausted the max 5 per key limit. This logic would probably preclude any use of the deep-paging optimisation where you can pass just the ScoreDoc of the last entry on the previous page to minimise the size of the PQ created for subsequent pages. New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV3.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV3.patch Updated patch. Added DiversifiedTopDocsCollector and associated test. This class represents the primary use-case for wanting to include a new remove() method in PriorityQueue. The PriorityQueue has original upHeap/downHeap methods unchanged in case of any performance change and a new specialised upHeap/downHeap that takes a position to support the new remove function. New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV2.patch, LUCENE-PQRemoveV3.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV2.patch) New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV3.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV1.patch) New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV2.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV2.patch Added missing upHeap call to remove method. Added extra randomized tests and method to check validity of PQ elements as mutations are made. New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV2.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223307#comment-14223307 ] Mark Harwood commented on LUCENE-6066: -- Thanks for your comments, Stefan. The remove method I believe is implemented correctly now. bq. it still seems that specialized versions can outperform generic ones Yes, the DiversifyingPriorityQueue that I imagined would need access to a new remove method in the existing PriorityQueue looks like it is better implemented as a fork of the existing PriorityQueue. I'll attach this fork here in a future addition. Maybe with these differing implementations there is a need to have a common interface that provides an abstraction for things like TopDocsCollector to add and pop results. New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV2.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6066) New remove method in PriorityQueue
Mark Harwood created LUCENE-6066: Summary: New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV1.patch New remove(element) method in PriorityQueue and related test New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV1.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219651#comment-14219651 ] Mark Harwood commented on LUCENE-6066: -- If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV1.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219822#comment-14219822 ] Mark Harwood commented on LUCENE-6066: -- I guess it's different from grouping in that: 1) it only involves one pass over the data 2) the client doesn't have to guess the number of groups he is going to need to get up-front 3) We don't get any filler docs in each group's results i.e. a bunch of irrelevant docs for an author with one good hit. New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV1.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219901#comment-14219901 ] Mark Harwood commented on LUCENE-6066: -- An analogy might be making a compilation album of 1967's top hit records: 1) A vanilla Lucene query's results might look like a Best of the Beatles album - no diversity 2) A grouping query would produce The 10 top-selling artists of 1967 - some killer and quite a lot of filler 3) A diversified query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV1.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New remove method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220089#comment-14220089 ] Mark Harwood commented on LUCENE-6066: -- bq. But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)? I was thinking of this as a foundation: (pseudo code) {code:title=DiversifyingPriorityQueue.java|borderStyle=solid} abstract class KeyedElement { int pqPos; abstract Object getKey(); } class DiversifyingPriorityQueueT extends KeyedElement extends PriorityQueueT { FastRemovablePriorityQueueT mainPQ; MapObject, PriorityQueue perKeyQueues; } {code} You can probably guess at the logic but it is based around: * making sure each key has a max of n entries using an entry in perKeyQueues. * Evictions from the mainPQ will require removal from the related perKeyQueue * Emptied perKeyQueues can be recycled for use with other keys * Evictions from the perKeyQueue will require removal from the mainPQ bq. This seems promising, maybe as a separate dedicated (forked) PQ impl? Yes, introducing a linear-cost remove by marking elements with a position is an added cost that not all PQs will require so forking seems necessary. In this case a common abstraction for these different PQs would be useful for the places where results are consumed e.g. TopDocsCollector New remove method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 Attachments: LUCENE-PQRemoveV1.patch It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of downHeap produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special DiversifyingPriorityQueue which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text
[ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-725: Attachment: NovelAnalyzer.java Updated to work with Lucene 4 APIs. NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text --- Key: LUCENE-725 URL: https://issues.apache.org/jira/browse/LUCENE-725 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Assignee: Otis Gospodnetic Priority: Minor Attachments: NovelAnalyzer.java, NovelAnalyzer.java, NovelAnalyzer.java, NovelAnalyzer.java This is a class I have found to be useful for analyzing small (in the hundreds) collections of documents and removing any duplicate content such as standard disclaimers or repeated text in an exchange of emails. This has applications in sampling query results to identify key phrases, improving speed-reading of results with similar content (eg email threads/forum messages) or just removing duplicated noise from a search index. To be more generally useful it needs to scale to millions of documents - in which case an alternative implementation is required. See the notes in the Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4866) Lucene corruption
[ https://issues.apache.org/jira/browse/LUCENE-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608826#comment-13608826 ] Mark Harwood commented on LUCENE-4866: -- The fact that the missing file looks to be held on a shared drive might also be significant if there is 1 Lucene process configured to access the same directory ... Lucene corruption - Key: LUCENE-4866 URL: https://issues.apache.org/jira/browse/LUCENE-4866 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 3.5 Environment: Amazone tomcat cluster with NTFS. Reporter: sachin Priority: Blocker Hi all, We know that lucene index gets corrupted. in our case they are corrupting again and again due to this production is incosistent. followiing errors are observed. Any help will be helpful. org.hibernate.search.SearchException: Unable to reopen IndexReader at org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:230) at org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:73) at org.hibernate.search.reader.impl.MultiReaderFactory.openReader(MultiReaderFactory.java:49) at org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:596) at org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:495) at org.hibernate.search.query.engine.impl.HSQueryImpl.queryEntityInfos(HSQueryImpl.java:239) at org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.list(FullTextQueryImpl.java:209) at com.lifetech.ngs.dataaccess.spring.util.SearchUtil.returnProjectionData(SearchUtil.java:646) at com.lifetech.ngs.dataaccess.spring.util.SearchUtil.getSinglePropertyOnlyUsingSearch(SearchUtil.java:556) at com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$FastClassByCGLIB$$568d5972.invoke(generated) at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622) at com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$EnhancerByCGLIB$$47fb00d0.getSinglePropertyOnlyUsingSearch(generated) at com.lifetech.ngs.server.impl.SampleManagerImpl.getNameSearchResult(SampleManagerImpl.java:2436) at com.lifetech.ngs.server.impl.SampleManagerImpl$$FastClassByCGLIB$$17af181d.invoke(generated) at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622) at com.lifetech.ngs.server.impl.SampleManagerImpl$$EnhancerByCGLIB$$75b745f9.getNameSearchResult(generated) at com.lifetech.ngs.webui.mgc.widgets.sample.SearchSamplesView.populateData(SearchSamplesView.java:635) at com.lifetech.ngs.webui.customcomponents.IRAutoComplete.changeVariables(IRAutoComplete.java:39) at com.vaadin.terminal.gwt.server.AbstractCommunicationManager.changeVariables(AbstractCommunicationManager.java:1445) at com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariableBurst(AbstractCommunicationManager.java:1393) at com.lifetech.ngs.webui.main.SpringVaadinServlet$1.handleVariableBurst(SpringVaadinServlet.java:57) at com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariables(AbstractCommunicationManager.java:1312) at com.vaadin.terminal.gwt.server.AbstractCommunicationManager.doHandleUidlRequest(AbstractCommunicationManager.java:763) at
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575740#comment-13575740 ] Mark Harwood commented on LUCENE-4768: -- As with any discussion about nested queries you need to be very clear about the required logic. When you talk about matching f1:A or f1:B - are we talking about matches on the same child doc or possibly matches on different child docs of the same parent? The examples don't make this clear. If we assume your child-based criteria is focused on examining the contents of single children (as opposed to combining f1:A on one child doc with f1:B on a different child doc) then a BooleanQuery that combines these child query elements will already be sufficient for skipping through children. Not really sure what you are trying to optimize anyway with skipping - parent-child combos are limited to what fits into a single segment which is in turn limited by RAM. You don't generally get parents with many many children because of these constraints. The nextDoc calls you are trying to skip are related to a compressed block of child doc IDs (gap encoded varints) that are read off disk in 1K chunks (if I recall default Directory settings correctly). The chances are high that the limited number of child docIDs that belong to each parent are already in RAM as part of normal disk access patterns so there is no real saving in disk IO. Are you sure this is a performance bottleneck? Child Traversable To Parent Block Join Query Key: LUCENE-4768 URL: https://issues.apache.org/jira/browse/LUCENE-4768 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Environment: trunk git rev-parse HEAD 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 Reporter: Vadim Kirilchuk Attachments: LUCENE-4768-draft.patch Hi everyone! Let me describe what i am trying to do: I have hierarchical documents ('car model' as parent, 'trim' as child) and use block join queries to retrieve them. However, i am not happy with current behavior of ToParentBlockJoinQuery which goes through all parent childs during nextDoc call (accumulating scores and freqs). Consider the following example, you have a query with a custom post condition on top of such bjq: and during post condition you traverse scorers tree (doc-at-time) and want to manually push child scorers of bjq one by one until condition passes or current parent have no more childs. I am attaching the patch with query(and some tests) similar to ToParentBlockJoin but with an ability to traverse childs. (i have to do weird instance of check and cast inside my code) This is a draft only and i will be glad to hear if someone need it or to hear how we can improve it. P.s i believe that proposed query is more generic (low level) than ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() internally during nextDoc(). Also, i think that the problem of traversing hierarchical documents is more complex as lucene have only nextDoc API. What do you think about making api more hierarchy aware? One level document is a special case of multi level document but not vice versa. WDYT? Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575825#comment-13575825 ] Mark Harwood commented on LUCENE-4768: -- Still not sure what problem you are trying to solve. bq. i need to know field and text for each matched leaf scorer Why? For scoring purposes? ToParentBJQ has a configurable ScoreMode to control if you want the max, avg or sum of the child matches rolled into the combined parent score. Is that insufficient control for your needs? Child Traversable To Parent Block Join Query Key: LUCENE-4768 URL: https://issues.apache.org/jira/browse/LUCENE-4768 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Environment: trunk git rev-parse HEAD 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 Reporter: Vadim Kirilchuk Attachments: LUCENE-4768-draft.patch Hi everyone! Let me describe what i am trying to do: I have hierarchical documents ('car model' as parent, 'trim' as child) and use block join queries to retrieve them. However, i am not happy with current behavior of ToParentBlockJoinQuery which goes through all parent childs during nextDoc call (accumulating scores and freqs). Consider the following example, you have a query with a custom post condition on top of such bjq: and during post condition you traverse scorers tree (doc-at-time) and want to manually push child scorers of bjq one by one until condition passes or current parent have no more childs. I am attaching the patch with query(and some tests) similar to ToParentBlockJoin but with an ability to traverse childs. (i have to do weird instance of check and cast inside my code) This is a draft only and i will be glad to hear if someone need it or to hear how we can improve it. P.s i believe that proposed query is more generic (low level) than ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() internally during nextDoc(). Also, i think that the problem of traversing hierarchical documents is more complex as lucene have only nextDoc API. What do you think about making api more hierarchy aware? One level document is a special case of multi level document but not vice versa. WDYT? Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575864#comment-13575864 ] Mark Harwood commented on LUCENE-4768: -- OK - this problem seems to be about an ill-defined user query (Saturn sky blue Sedan with no explicit fields) being executed against a well-defined schema (cars with manufacturers, model names and bodyStyles that also have trims with colours). If that's the case you have a heap of problems here which aren't necessarily related to the block join implementation. One example - IDF ranking being what it is, if a manufacturer like Ford create a model called the Blue or you have bad data entry that has an example of this value stored in the wrong field then Lucene will naturally rank model:blue higher than color:blue because of the scarcity of the token blue in that field context. That's almost the inverse of what you want. A couple of suggestions for field-less queries like your example of Saturn sky blue sedan 1) Target the query on an unstructured onebox field that holds indexed content from all fields to achieve a more balanced IDF score. 2) Tokenize each item in the query string and find a most likely field for each search term by examining doc frequencies e.g. color:blue vs modelName:blue etc. Augment the onebox query in 1) with the most-likely-field interpretation for each word in the query string if it has sufficient doc frequency. Child Traversable To Parent Block Join Query Key: LUCENE-4768 URL: https://issues.apache.org/jira/browse/LUCENE-4768 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Environment: trunk git rev-parse HEAD 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 Reporter: Vadim Kirilchuk Attachments: LUCENE-4768-draft.patch Hi everyone! Let me describe what i am trying to do: I have hierarchical documents ('car model' as parent, 'trim' as child) and use block join queries to retrieve them. However, i am not happy with current behavior of ToParentBlockJoinQuery which goes through all parent childs during nextDoc call (accumulating scores and freqs). Consider the following example, you have a query with a custom post condition on top of such bjq: and during post condition you traverse scorers tree (doc-at-time) and want to manually push child scorers of bjq one by one until condition passes or current parent have no more childs. I am attaching the patch with query(and some tests) similar to ToParentBlockJoin but with an ability to traverse childs. (i have to do weird instance of check and cast inside my code) This is a draft only and i will be glad to hear if someone need it or to hear how we can improve it. P.s i believe that proposed query is more generic (low level) than ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() internally during nextDoc(). Also, i think that the problem of traversing hierarchical documents is more complex as lucene have only nextDoc API. What do you think about making api more hierarchy aware? One level document is a special case of multi level document but not vice versa. WDYT? Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476854#comment-13476854 ] Mark Harwood commented on SOLR-3950: BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat and adds .blm files to the other files created by the choice of delegate. However your code has instantiated a BloomFilterPostingsFormat without passing a choice of delegate - presumably using the zero-arg constructor. The comments in the code for this zero-arg constructor state: // Used only by core Lucene at read-time via Service Provider instantiation - // do not use at Write-time in application code. Attempting postings=BloomFilter results in UnsupportedOperationException -- Key: SOLR-3950 URL: https://issues.apache.org/jira/browse/SOLR-3950 Project: Solr Issue Type: Bug Affects Versions: 4.1 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@bigindy5 ~]# java -version java version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) Reporter: Shawn Heisey Fix For: 4.1 Tested on branch_4x, checked out after BlockPostingsFormat was made the default by LUCENE-4446. I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and copied it into my sharedLib directory. When I subsequently tried postings=BloomFilter I got a the following exception in the log: {code} Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log SEVERE: java.lang.UnsupportedOperationException: Error - org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been constructed without a choice of PostingsFormat {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477036#comment-13477036 ] Mark Harwood commented on SOLR-3950: bq. If there is some schema config that will tell Solr to do the right thing, please let me know. Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as to what delegate it will use before you can use it at write-time. I think we have 3 options: 1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF e.g. Lucene40PF so you would have a zero-arg-constructor class named something like BloomLucene40PF or... 2) Solr extends config file format to provide a generic means of assembling wrapper PFs like Bloom in their config e.g: postingsFormat=BloomFilter delegatePostingsFormat=FooPF and Solr then does reflection magic to call constructors appropriately or.. 3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for BloomPF. Of these 1) feels like the right thing. Cheers Mark Attempting postings=BloomFilter results in UnsupportedOperationException -- Key: SOLR-3950 URL: https://issues.apache.org/jira/browse/SOLR-3950 Project: Solr Issue Type: Bug Affects Versions: 4.1 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@bigindy5 ~]# java -version java version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) Reporter: Shawn Heisey Fix For: 4.1 Tested on branch_4x, checked out after BlockPostingsFormat was made the default by LUCENE-4446. I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and copied it into my sharedLib directory. When I subsequently tried postings=BloomFilter I got a the following exception in the log: {code} Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log SEVERE: java.lang.UnsupportedOperationException: Error - org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been constructed without a choice of PostingsFormat {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work
[ https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476044#comment-13476044 ] Mark Harwood commented on LUCENE-3772: -- For bigger-than-memory docs is it not possible to use nested documents to represent subsections (e.g. a child doc for each of the chapters in a book) and then use BlockJoinQuery to select the best child docs? Highlighting can then be used on a more-manageable subset of the original content and Lucene's ranking algos are being used to select the best fragment rather than the highlighter's own attempts to reproduce this logic. Obviously depends on the shape of your content/queries but books-and-chapters is probably a good fit for this approach. Highlighter needs the whole text in memory to work -- Key: LUCENE-3772 URL: https://issues.apache.org/jira/browse/LUCENE-3772 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5 Environment: Windows 7 Enterprise x64, JRE 1.6.0_25 Reporter: Luis Filipe Nassif Labels: highlighter, improvement, memory Highlighter methods getBestFragment(s) and getBestTextFragments only accept a String object representing the whole text to highlight. When dealing with very large docs simultaneously, it can lead to heap consumption problems. It would be better if the API could accept a Reader objetct additionally, like Lucene Document Fields do. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452900#comment-13452900 ] Mark Harwood commented on LUCENE-4369: -- SingleTermField ? Not sure matching vs searching is a commonly understood differentiation. StringFields name is unintuitive and not helpful Key: LUCENE-4369 URL: https://issues.apache.org/jira/browse/LUCENE-4369 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-4369.patch There's a huge difference between TextField and StringField, StringField screws up scoring and bypasses your Analyzer. (see java-user thread Custom Analyzer Not Called When Indexing as an example.) The name we use here is vital, otherwise people will get bad results. I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452914#comment-13452914 ] Mark Harwood commented on LUCENE-4369: -- Agreed on the need for a change - names are important. I have a problem with using match on its own because the word is often associated with partial matching e.g. best match or fuzzy match. A quick google suggests match has more connotations with fuzziness than exactness - there are 162m results for best match vs only 45m results for exact match. So how about ExactMatchField? StringFields name is unintuitive and not helpful Key: LUCENE-4369 URL: https://issues.apache.org/jira/browse/LUCENE-4369 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-4369.patch There's a huge difference between TextField and StringField, StringField screws up scoring and bypasses your Analyzer. (see java-user thread Custom Analyzer Not Called When Indexing as an example.) The name we use here is vital, otherwise people will get bad results. I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433045#comment-13433045 ] Mark Harwood commented on LUCENE-4069: -- bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact use case. Fair enough - the original patch targeted Lucene 3.6 which benefited more heavily from this technique. The issue then morphed into a 4.x patch where performance gains were harder to find. I think the sweet spot is in primary key searches on indexes with ongoing heavy changes (more segment fragmentation, less OS-level caching?). This is the use case I am targeting currently and my final tests using our primary-key-counting test rig saw a 10 to 15% improvement over Pulsing. bq. I'm asking because I need his feature but I'm stuck with 3.x for a while. I have a client in a similar situation who are contemplating using the 3.6 patch. bq. Is there bugs which should be fixed in initial 3.6 patch? It has been a while since I looked at it - a quick run of ant test on my copy here showed no errors. I will be giving it a closer review if my client decides to go down this route and can post any fixes here. I expect if you use the patch and get into trouble you can use an un-patched version of 3.6 to read the same index files (it should just ignore the extra blm files created by the patched version). Segment-level Bloom filters --- Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0-BETA, 5.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-4069. -- Resolution: Fixed Assignee: Mark Harwood Committed to 4.0 branch, revision 1368442 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427322#comment-13427322 ] Mark Harwood commented on LUCENE-4069: -- Will do. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Fix Version/s: 5.0 Applied to trunk in revision 1368567 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0, 5.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426481#comment-13426481 ] Mark Harwood commented on LUCENE-4275: -- Nailed it, Mike. Yet another beer I owe you. I removed the IllegalStateException and it looks like the retry logic is now kicking in and all tests pass This reliance on throwing a particular exception type feels like an important contract to document. Currently the comments in PostingsFormat.fieldsProducer() read as follows: bq. Reads a segment. NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. I propose adding: bq. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments I'll roll that documentation addition into my Lucene-4069 patch Threaded tests with MockDirectoryWrapper delete active PostingFormat files -- Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 Attachments: Lucene-4275-TestClass.patch As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-4275. Resolution: Not A Problem Threaded tests with MockDirectoryWrapper delete active PostingFormat files -- Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 Attachments: Lucene-4275-TestClass.patch As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated with fix to issue explored in Lucene-4275 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated patch to bring in line with latest core API changes. All tests now pass clean so will commit soon Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
Mark Harwood created LUCENE-4275: Summary: Threaded tests with MockDirectoryWrapper delete active PostingFormat files Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4275: - Attachment: Lucene-4275-TestClass.patch Attached simple PostingsFormat used to illustrate cases of files going missing in PF tests. Threaded tests with MockDirectoryWrapper delete active PostingFormat files -- Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 Attachments: Lucene-4275-TestClass.patch As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425895#comment-13425895 ] Mark Harwood commented on LUCENE-4275: -- Thanks, Rob. This test requires a call to ant clean between each run before it will consistently work. However, I don't consider that a fix and assume that we are still looking for a bug here as there's an index consistency issue lurking somewhere here. I've tried adding the setting -Dtests.directory=RAMDirectory but the test still looks to have some memory between runs. I added some logging of creates and deletes as you suggest and it looks like on a second, un-cleansed run, my PF is being called to open a high-numbered segment which I suspect was created by an earlier run as the logging doesn't show signs of the PF being asked to created content for this (or any other) segment as part of the current run. At this point it fails as there is no longer a copy of the foobar file listed by the directory. I have noticed in the logs from previous runs MDW is asked to delete the segment's foobar file by IndexWriter as part of compaction into a compound CFS. Hope this sheds some light as I'm finding this a complex one to debug. Threaded tests with MockDirectoryWrapper delete active PostingFormat files -- Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 Attachments: Lucene-4275-TestClass.patch As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: 4069Failure.zip Attached a log of thread activity showing how TestIndexWriterCommit.testCommitThreadSafety() is failing. At this stage I can't tell if this is a failing in MockDirectoryWrapper or the test or the BloomPF class but it is related to files being removed unexpectedly. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418314#comment-13418314 ] Mark Harwood commented on LUCENE-4069: -- One more remaining issue before I commit which has appeared sporadically and looks to be consistently raised by this test: ant test -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=ISO-8859-1 The error it produces is this: [junit4:junit4] Caused by: java.lang.IllegalStateException: Missing file:_9_TestBloomFilteredLucene40Postings_0.blm [junit4:junit4]at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.init(BloomFilteringPostingsFormat.java:175) [junit4:junit4]at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156) MockDirectoryWrapper looks to be randomly deleting files (probably my blm file shown above) to simulate the effects of crashes. Presumably I am doing the right thing in always throwing an exception if the .blm file is missing? The alternative would be to silently ignore the missing file which seems undesirable. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418411#comment-13418411 ] Mark Harwood commented on LUCENE-4069: -- bq. I wonder if it has to do w/ only opening the file in the close() method ( Just tried opening the file earlier (in BloomFilteredConsumer constructor) and that didn't fix it. I previously also added an extra Directory.fileExists() sanity check immediately after closing the IndexOutput and all was well so I think it's something happening after that. Will need to dig deeper. I'm running on WinXP 64bit if that is of any significance to MDW's behaviour. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416007#comment-13416007 ] Mark Harwood commented on LUCENE-4069: -- bq. At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? bq. Hmm why is anonymity at search time important? It would seem to be an established design principle - see https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726 It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416037#comment-13416037 ] Mark Harwood commented on LUCENE-4069: -- bq. If a special decoder for foobar is needed, it must be loadable by SPI. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416084#comment-13416084 ] Mark Harwood commented on LUCENE-4069: -- bq. MessageDigest.getInstance(name) should be the way to go I'm less keen now - a quick scan of the docs around MessageDigest throws up some issues: 1) SPI registration of MessageDigest providers looks to get into permissions hell as it is closely related to security - see http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling which talks about the steps required to approve a trusted provider. 2) MessageDigest as an interface is designed to stream content in potentially many method calls past the hashing algo. MurmurHash2.java is not currently written to process content this way and suits our needs in hashing small blocks of content in one hit. For these 2 reasons it looks like MessageDigest may be a pain to adopt and the existing approach proposed in this patch may be preferable. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch New patch with use of SegmentWriteState to right-size the choice if bitset for volume of content. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416383#comment-13416383 ] Mark Harwood commented on LUCENE-4069: -- A quick benchmark looks like the new right-sized bitset as opposed to the old worst-case-scenario-sized bitset is buying us a small performance improvement. bq. I also don't think this PF should be per-field There was a lengthy discussion earlier on this topic. The approach presented here seems reasonable. For the average user you have the DefaultBloomFilterFactory default which now has reasonable sizing for all fields passed its way (assuming a heuristic based on numDocs=numKeys to anticipate). For expert users you can provide a BloomFilterFactory with a custom choice of sizing heuristic per-field and can also simply return null for non-bloomed fields. Having a single, carefully configured BloomPF wrapper is preferable because you can channel appropriately configured bloom settings to a common PF delegate and avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. You don't *have* to use the Per-field stuff in BloomPF but there are benefits to be had in doing so which can't otherwise be achieved. bq. Can you add @lucene.experimental to all the new APIs? Done. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added bloom package.html and changes.txt. I plan to commit in a day or two if there are no objections. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415362#comment-13415362 ] Mark Harwood commented on LUCENE-4069: -- bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. bq. Seems like LUCENE-4198 needs to solve this same problem. Another possibly related point on more access to merge context - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot trending terms whose term frequencies differ drastically between the merging source segments. This would require access to source segment as term postings are streamed to observe the change in counts. bq. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing. bq. can't the postings format impl pass in an instance of HashFunction when making the FuzzySet I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function. I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results. bq. Can you move the imports under the copyright header? Will do Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410145#comment-13410145 ] Mark Harwood commented on LUCENE-4069: -- bq. So now we are close to 1M lookups/sec for a single thread! Cool! bq. I wonder if somehow we can do a better job picking the right sized bit vector up front? bq. You basically need to know up front how many unique terms will be in the given field for this segment right? Yes - the job of anticipating the number of unique keys probably has 2 different contexts: 1) Net new segments e.g. guessing up front how many docs/keys a user is likely to generate in a new segment before the flush settings kick in. 2) Merged segments e.g. guessing how many unique keys survive a merge operation Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Arguably the BloomFilterFactory.getSetForField() method already represents where this setting can be controlled. In context 2 where potentially large merges occur we could look at adding an extra method to BloomFilterFactory to handle this different context e.g. something like FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext) Based on the size of the segments being merged and volumes of deletes a more appropriate size of Bloom bitset could be allocated based on a worst-case estimate. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Updated performance test with option to alter the ratio of inserts vs updates via keyspace size. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408097#comment-13408097 ] Mark Harwood commented on LUCENE-4069: -- Thanks for the extra tests, Mike. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation :) I've done some more benchmarks with the updated test and the performance characteristics are becoming clearer as shown in these results: http://goo.gl/dtWSb Bloom performance is better than Pulsing but the gap narrows with the volumes of deletes lying around in old segments, caused by updates. In these cases the BloomFilter gives a false positive and falls back to the equivalent operations of Pulsing. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. So overall there still looks to be a benefit and especially in low-frequency update scenarios. I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a new file) before thinking about committing. Cheers Mark Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files
[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407099#comment-13407099 ] Mark Harwood commented on LUCENE-4190: -- -1 for merrily wiping contents of whatever directory a user happens to pick for an index location +0 on requiring all codecs to declare filenames because I take on board Rob's points re complexity +1 for the _* name-spacing proposal as a sensible compromise IndexWriter deletes non-Lucene files Key: LUCENE-4190 URL: https://issues.apache.org/jira/browse/LUCENE-4190 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Robert Muir Fix For: 4.0, 5.0 Attachments: LUCENE-4190.patch, LUCENE-4190.patch Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to bound. But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff. I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _base36(_X).Y), so we are much less likely to delete a non-Lucene file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added customizable saturation threshold after which Bloom filters are retired and no longer maintained (due to merges creating v large segments) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Attached a performance test (adapted from Mike's PKLookupPerfTest) that demonstrates the worst-case scenario where BloomFilter offers the 2x speed up not previously revealed in Mike's other tests. This test case mixes reads and writes on a growing index and is representative of the real-world scenario I am seeking to optimize. See the javadoc for test details. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Fix for the not downsizing bug and a subsequent issue which that fix revealed. The 2nd issue was that on saturation, the downsize method would actually upsize into a bigger bitset. This causes false negatives on searches - it's safe to downsize the indexing bitset but not upsize as there is already some information loss involved. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKeyPerfTest40.java) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKeyPerfTest40.java Updated Performance test code based on new IndexReader changes for accessing subreaders Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396934#comment-13396934 ] Mark Harwood commented on LUCENE-4069: -- Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here - is it easy for you to kick off another run with the latest patch on your setup? The latest change to the patch shouldn't require an index rebuild from your last run. No worries if this is too much hassle for you - I'll probably just try switch to testing on OSX at home. Cheers, Mark Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397054#comment-13397054 ] Mark Harwood commented on LUCENE-4069: -- bq. problem: I'll run perf test again. It's easy... Great, thanks. bq. Alas it's not easy ... please report back on how to make it easier to set up! My Windows-based woes were: 1) Had to install python (used 2.7) 2) Figure out python proxy settings for Wikipedia download 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually 4) Ran first python test and it aborted with complaint about GnuPlot missing I imagine most of what is needed here comes out of the box on typical OSX/Linux setup. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395773#comment-13395773 ] Mark Harwood commented on LUCENE-4069: -- Interesting results, Mike - thanks for taking the time to run them. bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395857#comment-13395857 ] Mark Harwood commented on LUCENE-4069: -- bq. I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms Good catch - a quick test indeed shows a speed up on fuzzy queries. I'll prepare a new patch. I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org