Re: QueryParser - proposed change may break existing queries.
>You could avoid (some of?) these problems by supporting /(?i)foo/ instead of /foo/i That would avoid our parsing dilemma but brings some other concerns. This inline syntax can normally be used to selectively turn on case sensitivity for sections of a regex and then turn it off with (?-i). We could potentially implement this support in the underlying o.a.l.util.automaton.RegExp class. We changed that class recently to take a separate global flag alongside the regex string which can determine case sensitivity. I guess any inline (?i) syntax would override whatever default option had been passed in the constructor flag. That might be a hairy change though - the RegExp parser logic is hand-crafted rather than JavaCC. On Fri, Sep 18, 2020 at 7:47 AM Dawid Weiss wrote: > > If they try to use any other options then 'i' we throow a ParseException > > +1. Complex-syntax parsers should throw (human-palatable) exceptions > on syntax errors. A lenient, "naive user" query parser should be > separate and accept a very, very > rudimentary query syntax (so that there are literally no chances of > making a syntax error). > > D. > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: QueryParser - proposed change may break existing queries.
I think the decision comes down to choosing between silent (mis)interpratations of ambiguous queries or noisy failures.. On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler wrote: > Hi, > > > > My idea would have been not to bee too strict and instead only detect it > as a regex if its separated. So /foo/bar and /foo/iphone would both go > through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would > interpret the first token as regex. > > > > That’s just my idea, not sure if it makes sense to have this relaxed > parsing. I was always very skeptical of adding the regexes, as it breaks > many queries. Now it’s even more. > > > > Uwe > > > > - > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > *From:* Mark Harwood > *Sent:* Wednesday, September 16, 2020 6:45 PM > *To:* dev@lucene.apache.org > *Subject:* Re: QueryParser - proposed change may break existing queries. > > > > The strictness I was thinking of adding was to make all of the following > error: > > /foo/bar > > /foo//bar/ > > /foo/iphone > > /foo/AND x > > > > These would be allowed: > > /foo/i bar > > (/foo/ OR /bar/) > > (/foo/ OR /bar/i) > > /foo/^2 > > /foo/i^2 > > > > > > > > On 16 Sep 2020, at 12:00, Uwe Schindler wrote: > > > > In my opinion, the proposed syntax change should enforce to have > whitespace or any other separator chat after the regex “i” parameter. > > > > Uwe > > > > - > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > *From:* Mark Harwood > *Sent:* Wednesday, September 16, 2020 11:04 AM > *To:* dev@lucene.apache.org > *Subject:* QueryParser - proposed change may break existing queries. > > > > In Lucene-9445 we'd like to add a case insensitive option to regex queries > in the query parser of the form: > >/Foo/i > > > > However, today people can search for : > > > >/foo.com/index.html > > > > and not get an error. The searcher may think this is a query for a URL but > it's actually parsed as a regex "foo.com" ORed with a term query. > > > > I'd like to draw attention to this proposed change in behaviour because I > think it could affect many existing systems. Arguably it may be a positive > in drawing attention to a number of existing silent failures (unescaped > searches for urls or file paths) but equally could be seen as a negative > breaking change by some. > > > > What is our BWC policy for changes to query parser? > > Do the benefits of the proposed new regex feature outweigh the costs of > the breakages in your view? > > > > > https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793 > > > > > >
Re: QueryParser - proposed change may break existing queries.
The strictness I was thinking of adding was to make all of the following error: /foo/bar /foo//bar/ /foo/iphone /foo/AND x These would be allowed: /foo/i bar (/foo/ OR /bar/) (/foo/ OR /bar/i) /foo/^2 /foo/i^2 > On 16 Sep 2020, at 12:00, Uwe Schindler wrote: > > > In my opinion, the proposed syntax change should enforce to have whitespace > or any other separator chat after the regex “i” parameter. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > From: Mark Harwood > Sent: Wednesday, September 16, 2020 11:04 AM > To: dev@lucene.apache.org > Subject: QueryParser - proposed change may break existing queries. > > In Lucene-9445 we'd like to add a case insensitive option to regex queries in > the query parser of the form: >/Foo/i > > However, today people can search for : > >/foo.com/index.html > > and not get an error. The searcher may think this is a query for a URL but > it's actually parsed as a regex "foo.com" ORed with a term query. > > I'd like to draw attention to this proposed change in behaviour because I > think it could affect many existing systems. Arguably it may be a positive in > drawing attention to a number of existing silent failures (unescaped searches > for urls or file paths) but equally could be seen as a negative breaking > change by some. > > What is our BWC policy for changes to query parser? > Do the benefits of the proposed new regex feature outweigh the costs of the > breakages in your view? > > https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793 > >
QueryParser - proposed change may break existing queries.
In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form: /Foo/i However, today people can search for : /foo.com/index.html and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com" ORed with a term query. I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some. What is our BWC policy for changes to query parser? Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view? https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
Re: [VOTE] Solr to become a top-level Apache project (TLP)
+1 On 2020/05/12 07:36:57, Dawid Weiss wrote: > Dear Lucene and Solr developers! > > According to an earlier [DISCUSS] thread on the dev list [2], I am > calling for a vote on the proposal to make Solr a top-level Apache > project (TLP) and separate Lucene and Solr development into two > independent entities. > > To quickly recap the reasons and consequences of such a move: it seems > like the reasons for the initial merge of Lucene and Solr, around 10 > years ago, have been achieved. Both projects are in good shape and > exhibit signs of independence already (mailing lists, committers, > patch flow). There are many technical considerations that would make > development much easier if we move Solr out into its own TLP. > > We discussed this issue [2] and both PMC members and committers had a > chance to review all the pros and cons and express their views. The > discussion showed that there are clearly different opinions on the > matter - some people are in favor, some are neutral, others are > against or not seeing the point of additional labor. Realistically, I > don't think reaching 100% level consensus is going to be possible -- > we are a diverse bunch with different opinions and personalities. I > firmly believe this is the right direction hence the decision to put > it under the voting process. Should something take a wrong turn in the > future (as some folks worry it may), all blame is on me. > > Therefore, the proposal is to separate Solr from under Lucene TLP, and > make it a TLP on its own. The initial structure of the new PMC, > committer base, git repositories and other managerial aspects can be > worked out during the process if the decision passes. > > Please indicate one of the following (see [1] for guidelines): > > [ ] +1 - yes, I vote for the proposal > [ ] -1 - no, I vote against the proposal > > Please note that anyone in the Lucene+Solr community is invited to > express their opinion, though only Lucene+Solr committers cast binding > votes (indicate non-binding votes in your reply, please). > > The vote will be active for a week to give everyone a chance to read > and cast a vote. > > Dawid > > [1] https://www.apache.org/foundation/voting.html > [2] > https://lists.apache.org/thread.html/rfae2440264f6f874e91545b2030c98e7b7e3854ddf090f7747d338df%40%3Cdev.lucene.apache.org%3E > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
[ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876507#comment-16876507 ] Mark Harwood commented on LUCENE-8876: -- I reached out the paper author, Donna Harman a while ago and she just replied as follows: {quote}It has been a very long time since I have thought about S-stemmers. But looking at your examples of bees and employees, it seems to me that rule 3 is the correct one because rule 2 would be prevented from firing. {quote} Given her assertion that rule 3 should apply to "bees" then it looks like that this would make rule 2 entirely redundant. > EnglishMinimalStemmer does not implement s-stemmer paper correctly? > --- > > Key: LUCENE-8876 > URL: https://issues.apache.org/jira/browse/LUCENE-8876 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and > employees. > The [original > paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]] > has this table of rules: > !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! > The notes accompanying the table state : > {quote}"the first applicable rule encountered is the only one used" > {quote} > > For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer > misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes > != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. > "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 > in the table depending on if you take {{applicable}} to mean "the THEN part > of the rule has fired" or just that the suffix was referenced in the rule. > EnglishMinimalStemmer has assumed the latter and I think it should be the > former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove > any trailing S). That's certainly the conclusion I came to independently > testing on real data. > There are some additional changes I'd like to see in a plural stemmer but I > won't list them here - the focus should be making the code here match the > original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
[ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871423#comment-16871423 ] Mark Harwood commented on LUCENE-8876: -- {quote} but then doesn't it mean that exceptions of the 2nd rule are always ignored? {quote} Good point. Rule 1 exceptions are odd too - I have not found a single common English word that ends in aies or eies. > EnglishMinimalStemmer does not implement s-stemmer paper correctly? > --- > > Key: LUCENE-8876 > URL: https://issues.apache.org/jira/browse/LUCENE-8876 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and > employees. > The [original > paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]] > has this table of rules: > !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! > The notes accompanying the table state : > {quote}"the first applicable rule encountered is the only one used" > {quote} > > For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer > misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes > != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. > "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 > in the table depending on if you take {{applicable}} to mean "the THEN part > of the rule has fired" or just that the suffix was referenced in the rule. > EnglishMinimalStemmer has assumed the latter and I think it should be the > former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove > any trailing S). That's certainly the conclusion I came to independently > testing on real data. > There are some additional changes I'd like to see in a plural stemmer but I > won't list them here - the focus should be making the code here match the > original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?
Mark Harwood created LUCENE-8876: Summary: EnglishMinimalStemmer does not implement s-stemmer paper correctly? Key: LUCENE-8876 URL: https://issues.apache.org/jira/browse/LUCENE-8876 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Mark Harwood The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees. The [original paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]] has this table of rules: !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! The notes accompanying the table state : {quote}"the first applicable rule encountered is the only one used" {quote} For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in the table depending on if you take {{applicable}} to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any trailing S). That's certainly the conclusion I came to independently testing on real data. There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery
[ https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861960#comment-16861960 ] Mark Harwood commented on LUCENE-8840: -- {quote}we shouldn't favor documents that contain multiple variations of the same fuzzy term. {quote} For fuzzy I agree that rewarding more variations in a doc is probably undesirable - a doc will normally pick one spelling for a word and use it consistently so any variations are more likely to be false positives (your baz/bad example). Plurals and other forms of suffix would be a notable exception but I don't think that's too much of a problem because: # we can assume that stemming is taking care of normalizing these tokens. # a lot of fuzzy querying is for things like people names that aren't expressed as plurals or with other common suffixes I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a form of score blending for the expansions they create. Wildcards are perhaps unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we _are_ looking for multiple forms and a document that contains many is better than few. > TopTermsBlendedFreqScoringRewrite should use SynonymQuery > - > > Key: LUCENE-8840 > URL: https://issues.apache.org/jira/browse/LUCENE-8840 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Major > Attachments: LUCENE-8840.patch > > > Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite > method for Fuzzy queries, uses the BlendedTermQuery to score documents that > match the fuzzy terms. This query blends the frequencies used for scoring > across the terms and creates a disjunction of all the blended terms. This > means that each fuzzy term that match in a document will add their BM25 score > contribution. We already have a query that can blend the statistics of > multiple terms in a single scorer that sums the doc frequencies rather than > the entire BM25 score: the SynonymQuery. Since > https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles > boost between 0 and 1 so it should be easy to change the default rewrite > method for Fuzzy queries to use it instead of the BlendedTermQuery. This > would bound the contribution of each term to the final score which seems a > better alternative in terms of relevancy than the current solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8352) Make TokenStreamComponents final
[ https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509635#comment-16509635 ] Mark Harwood commented on LUCENE-8352: -- My use case was a bit special. I had a custom reader that [dealt with hyperlinked text|https://github.com/elastic/elasticsearch/issues/29467#issuecomment-385393246] and stripped out the hyperlink markup using a custom Reader before feeding the remaining plain-text into tokenisation. The tricky bit was the extracted URLs would not be thrown away but passed to a special TokenFilter at the end of the chain to inject at the appropriate positions in the text token stream. The workaround was a custom AnalyzerWrapper that overrode wrapReader (which is still invoked when wrapped) and then some ThreadLocal hackery to get my TokenFilter connected to the Reader's extracted urls. I'm not sure how common this sort of analysis is but before I reached this solution there was quite a detour trying to figure out why a custom TokenStreamComponents was not working when wrapped. > Make TokenStreamComponents final > > > Key: LUCENE-8352 > URL: https://issues.apache.org/jira/browse/LUCENE-8352 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Mark Harwood >Priority: Minor > > The current design is a little trappy. Any specialised subclasses of > TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, > UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap > them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, > ShingleAnalyzerWrapper and other examples in elasticsearch)_. > The current design means each AnalyzerWrapper.wrapComponents() implementation > discards any custom TokenStreamComponents and replaces it with one of its own > choosing (a vanilla TokenStreamComponents class from examples I've seen). > This is a trap I fell into when writing a custom TokenStreamComponents with a > custom setReader() and I wondered why it was not being triggered when wrapped > by other analyzers. > If AnalyzerWrapper is designed to encourage composition it's arguably a > mistake to also permit custom TokenStreamComponent subclasses - the > composition process does not preserve the choice of custom classes and any > behaviours they might add. For this reason we should not encourage extensions > to TokenStreamComponents (or if TSC extensions are required we should somehow > mark an Analyzer as "unwrappable" to prevent lossy compositions). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8352) Make TokenStreamComponents final
Mark Harwood created LUCENE-8352: Summary: Make TokenStreamComponents final Key: LUCENE-8352 URL: https://issues.apache.org/jira/browse/LUCENE-8352 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Mark Harwood The current design is a little trappy. Any specialised subclasses of TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, ShingleAnalyzerWrapper and other examples in elasticsearch)_. The current design means each AnalyzerWrapper.wrapComponents() implementation discards any custom TokenStreamComponents and replaces it with one of its own choosing (a vanilla TokenStreamComponents class from examples I've seen). This is a trap I fell into when writing a custom TokenStreamComponents with a custom setReader() and I wondered why it was not being triggered when wrapped by other analyzers. If AnalyzerWrapper is designed to encourage composition it's arguably a mistake to also permit custom TokenStreamComponent subclasses - the composition process does not preserve the choice of custom classes and any behaviours they might add. For this reason we should not encourage extensions to TokenStreamComponents (or if TSC extensions are required we should somehow mark an Analyzer as "unwrappable" to prevent lossy compositions). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-6747. > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Fix For: Trunk, 5.4 > > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch, fingerprintv4.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-6747. -- Resolution: Fixed Commited to trunk and 5.x > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Fix For: Trunk, 5.4 > > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch, fingerprintv4.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Fix Version/s: 5.3.1 Trunk > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Fix For: Trunk, 5.3.1 > > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch, fingerprintv4.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Fix Version/s: (was: 5.3.1) 5.4 > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Fix For: Trunk, 5.4 > > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch, fingerprintv4.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv4.patch Some final tweaks: 1) Found a bug where separator not appended if first token is length ==1 2) Randomized testing identified issue with input.end() not being called when IOExceptions occur 3) Added missing SPI entry for FingerprintFilterFactory and associated test class for FingerprintFilterFactory > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch, fingerprintv4.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv3.patch Updated patch - removed instanceof check and added entry to Changes.txt. Will commit to trunk and 5.x in a day or two if there's no objections > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Attachments: fingerprintv1.patch, fingerprintv2.patch, > fingerprintv3.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv2.patch Thanks for taking a look, Adrien. Added a v2 patch with following changes: 1) added call to input.end() to get final offset state 2) final state is retained using captureState() 3) Added a FingerprintFilterFactory class As for the alternative hashing idea : For speed reasons this would be a nice idea but reduces the read-ability of results if you want to debug any collisions or otherwise display connections. For compactness reasons (storing in doc values etc) it would always be possible to chain a conventional hashing algo in a TokenFilter on the end of this text-normalizing filter. (Do we already have a conventional hashing TokenFilter?) > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Attachments: fingerprintv1.patch, fingerprintv2.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
[ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6747: - Attachment: fingerprintv1.patch Proposed implementation and test > FingerprintFilter - a TokenFilter for clustering/linking purposes > - > > Key: LUCENE-6747 > URL: https://issues.apache.org/jira/browse/LUCENE-6747 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood >Priority: Minor > Attachments: fingerprintv1.patch > > > A TokenFilter that emits a single token which is a sorted, de-duplicated set > of the input tokens. > This approach to normalizing text is used in tools like OpenRefine[1] and > elsewhere [2] to help in clustering or linking texts. > The implementation proposed here has a an upper limit on the size of the > combined token which is output. > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth > [2] > https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes
Mark Harwood created LUCENE-6747: Summary: FingerprintFilter - a TokenFilter for clustering/linking purposes Key: LUCENE-6747 URL: https://issues.apache.org/jira/browse/LUCENE-6747 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Mark Harwood Priority: Minor A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens. This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts. The implementation proposed here has a an upper limit on the size of the combined token which is output. [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552265#comment-14552265 ] Mark Harwood commented on LUCENE-329: - Committed to 5.x branch and trunk > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550376#comment-14550376 ] Mark Harwood commented on LUCENE-329: - Thanks, I'll commit tomorrow if there's no objections. > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Cut-and-paste error in last patch set df=0 and effects were undetected by unit tests. Enhanced unit test to detect error then fixed > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: (was: LUCENE-329.patch) > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Last edits to remove unnecessary Math.max() tests. Added assertion around maxTTf expectations > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Updated following review comments (thanks, Adrien). All tests passing on trunk. > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch Switched to the TermContext.accumulateStatistics() method Adrien suggested for tweaking stats. > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, > LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Fix Version/s: (was: 3.1) (was: 4.0-ALPHA) 5.x > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 5.x > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-329: Attachment: LUCENE-329.patch New patch addressing this long-standing bug. Addresses the all-or-nothing choices of today where the default is a (poor) use of all IDF factors or a sub-optimal alternative of using a rewrite method with no IDF. The patch includes: 1) A new default FuzzyQuery rewrite method that balances IDF better 2) Unit tests for single and multi-query behaviours Additionally, this document offers more analysis based on quality tests on a slightly larger set of data not included here: https://docs.google.com/document/d/1KXhbUpD5GFyzNqfk3nocODOo7Upgpd5tmUQp4-OPwiM/edit#heading=h.2e8gdmdqf2m5 > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 1.2 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 3.1, 4.0-ALPHA > > Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-6066. Resolution: Fixed Fix Version/s: (was: 5.0) 5.1 Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.1 > > Attachments: LUCENE-6066.patch, LUCENE-PQRemoveV8.patch, > LUCENE-PQRemoveV9.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV9.patch Move DiversifiedTopDocsCollector and related unit test to "misc". Added "experimental" annotation. Removed superfluous "if ==0 " test in PriorityQueue. Thanks, Adrien. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV8.patch, LUCENE-PQRemoveV9.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309365#comment-14309365 ] Mark Harwood commented on LUCENE-6066: -- bq. maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core? It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a "remove" method on PQ). bq. I think we should also add a lucene.experimental annotation to this collector? That seems fair. bq. the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop? good point, will change. bq. Should PQ.downHeap and upHead delegate to their counterpart that takes a position? I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV8.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV7.patch) > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV8.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV6.patch) > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV8.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV8.patch Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike). Will commit to trunk and 5.1 in a day or 2 if no objections. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV8.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV7.patch Fixed the test PQ's impl of lessThan() which was causing test failures on duplicate Integers placed into queue. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV6.patch, LUCENE-PQRemoveV7.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV5.patch) > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV6.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV6.patch Removed outdated acceptDocsInOrder() method. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV6.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277279#comment-14277279 ] Mark Harwood commented on LUCENE-6066: -- What feels awkward in the example Junit is that diversified collections are not compatible with existing Sort functionality - I had to use a custom Similarity class to sort by the popularity of songs in my test data. Combining the diversified collector with any other form of existing collector (e.g. TopFieldCollector to achieve field-based sorting) via wrapping is problematic because the other collectors all work with an assumption that previously collected elements are never recalled. The diversifying collector needs the ability to recall previously collected elements when new elements with the same key need to be substituted. > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV5.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV5.patch Added Junit test showing use with String based dedup keys using 2 lookup impls - slow+accurate global ords and fast but potentially inaccurate hashing of BinaryDocValues > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV5.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV3.patch) > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV5.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) Collector that manages diversity in search results
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Description: This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class. (was: It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of "downHeap" produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special "DiversifyingPriorityQueue" which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. ) Summary: Collector that manages diversity in search results (was: New "remove" method in PriorityQueue) > Collector that manages diversity in search results > -- > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV3.patch > > > This issue provides a new collector for situations where a client doesn't > want more than N matches for any given key (e.g. no more than 5 products from > any one retailer in a marketplace). In these circumstances a document that > was previously thought of as competitive during collection has to be removed > from the final PQ and replaced with another doc (eg a retailer who already > has 5 matches in the PQ receives a 6th match which is better than his > previous ones). This requires a new remove method on the existing > PriorityQueue class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239328#comment-14239328 ] Mark Harwood commented on LUCENE-6066: -- Thanks for the review, Mike. I'm working through changes. bq. Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor? Oops. That was a cut/paste error transferring code from es that relied on a forked PriorityQueue which is obviously incompatible with the Lucene TopDocsCollector base class. bq. the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs I originally had a getKey(docId) method that returned an object - anything which implements hashCode and Equals. When I talked through with Adrien he suggested the use of NumericDocValues as a better abstraction which could be backed by any system based on ordinals. We need to decide on what this abstraction should be. One of the things I've been grappling with is if the collector should implement support for multi-keyed docs e.g. a field containing hashes for near-duplicate detection to avoid too-similar texts. This would require extra code in the collector to determine if any one key had exceeded limits (and ideally some memory-safeguard for docs with too many keys). >I saw a test about paging; how does/should paging work with such a collector? In regular collections, TopScoreDocCollector provides all of the smarts for in-order/out-of-order and starting from the ScoreDoc at the bottom of the previous page. I expect I would have to reimplement all of it's logic for a new DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about using updateTop() that don't apply when we have a two-tier system for scoring (globally competitive and within-key competitive). My vague assumption was that the logic for paging would have to be that any per-key constraints would apply across multiple pages e.g. having had 5 Beatles hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go into the results because it had exhausted the "max 5 per key" limit. This logic would probably preclude any use of the deep-paging optimisation where you can pass just the ScoreDoc of the last entry on the previous page to minimise the size of the PQ created for subsequent pages. > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV3.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV2.patch) > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV3.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV3.patch Updated patch. Added DiversifiedTopDocsCollector and associated test. This class represents the primary use-case for wanting to include a new remove() method in PriorityQueue. The PriorityQueue has original upHeap/downHeap methods unchanged in case of any performance change and a new specialised upHeap/downHeap that takes a position to support the new remove function. > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV2.patch, LUCENE-PQRemoveV3.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223307#comment-14223307 ] Mark Harwood commented on LUCENE-6066: -- Thanks for your comments, Stefan. The remove method I believe is implemented correctly now. bq. it still seems that specialized versions can outperform generic ones Yes, the DiversifyingPriorityQueue that I imagined would need access to a new remove method in the existing PriorityQueue looks like it is better implemented as a fork of the existing PriorityQueue. I'll attach this fork here in a future addition. Maybe with these differing implementations there is a need to have a common interface that provides an abstraction for things like TopDocsCollector to add and pop results. > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV2.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV2.patch Added missing upHeap call to remove method. Added extra randomized tests and method to check validity of PQ elements as mutations are made. > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV2.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: (was: LUCENE-PQRemoveV1.patch) > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV2.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220089#comment-14220089 ] Mark Harwood commented on LUCENE-6066: -- bq. But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)? I was thinking of this as a foundation: (pseudo code) {code:title=DiversifyingPriorityQueue.java|borderStyle=solid} abstract class KeyedElement { int pqPos; abstract Object getKey(); } class DiversifyingPriorityQueue extends PriorityQueue { FastRemovablePriorityQueue mainPQ; Map perKeyQueues; } {code} You can probably guess at the logic but it is based around: * making sure each key has a max of n entries using an entry in perKeyQueues. * Evictions from the mainPQ will require removal from the related perKeyQueue * Emptied perKeyQueues can be recycled for use with other keys * Evictions from the perKeyQueue will require removal from the mainPQ bq. This seems promising, maybe as a separate dedicated (forked) PQ impl? Yes, introducing a linear-cost remove by marking elements with a position is an added cost that not all PQs will require so forking seems necessary. In this case a common abstraction for these different PQs would be useful for the places where results are consumed e.g. TopDocsCollector > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV1.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219901#comment-14219901 ] Mark Harwood commented on LUCENE-6066: -- An analogy might be making a compilation album of 1967's top hit records: 1) A vanilla Lucene query's results might look like a "Best of the Beatles" album - no diversity 2) A grouping query would produce "The 10 top-selling artists of 1967 - some killer and quite a lot of filler" 3) A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV1.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219822#comment-14219822 ] Mark Harwood commented on LUCENE-6066: -- I guess it's different from grouping in that: 1) it only involves one pass over the data 2) the client doesn't have to guess the number of groups he is going to need to get up-front 3) We don't get any "filler" docs in each group's results i.e. a bunch of irrelevant docs for an author with one good hit. > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV1.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219651#comment-14219651 ] Mark Harwood commented on LUCENE-6066: -- If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV1.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6066) New "remove" method in PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-6066: - Attachment: LUCENE-PQRemoveV1.patch New remove(element) method in PriorityQueue and related test > New "remove" method in PriorityQueue > > > Key: LUCENE-6066 > URL: https://issues.apache.org/jira/browse/LUCENE-6066 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-PQRemoveV1.patch > > > It would be useful to be able to remove existing elements from a > PriorityQueue. > The proposal is that a linear scan is performed to find the element being > removed and then the end element in heap[size] is swapped into this position > to perform the delete. The method downHeap() is then called to shuffle the > replacement element back down the array but the existing downHeap method must > be modified to allow picking up an entry from any point in the array rather > than always assuming the first element (which is its only current mode of > operation). > A working javascript model of the proposal with animation is available here: > http://jsfiddle.net/grcmquf2/22/ > In tests the modified version of "downHeap" produces the same results as the > existing impl but adds the ability to push down from any point. > An example use case that requires remove is where a client doesn't want more > than N matches for any given key (e.g. no more than 5 products from any one > retailer in a marketplace). In these circumstances a document that was > previously thought of as competitive has to be removed from the final PQ and > replaced with another doc (eg a retailer who already has 5 matches in the PQ > receives a 6th match which is better than his previous ones). This particular > process is managed by a special "DiversifyingPriorityQueue" which wraps the > main PriorityQueue and could be contributed as part of another issue if there > is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6066) New "remove" method in PriorityQueue
Mark Harwood created LUCENE-6066: Summary: New "remove" method in PriorityQueue Key: LUCENE-6066 URL: https://issues.apache.org/jira/browse/LUCENE-6066 Project: Lucene - Core Issue Type: Improvement Components: core/query/scoring Reporter: Mark Harwood Priority: Minor Fix For: 5.0 It would be useful to be able to remove existing elements from a PriorityQueue. The proposal is that a linear scan is performed to find the element being removed and then the end element in heap[size] is swapped into this position to perform the delete. The method downHeap() is then called to shuffle the replacement element back down the array but the existing downHeap method must be modified to allow picking up an entry from any point in the array rather than always assuming the first element (which is its only current mode of operation). A working javascript model of the proposal with animation is available here: http://jsfiddle.net/grcmquf2/22/ In tests the modified version of "downHeap" produces the same results as the existing impl but adds the ability to push down from any point. An example use case that requires remove is where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This particular process is managed by a special "DiversifyingPriorityQueue" which wraps the main PriorityQueue and could be contributed as part of another issue if there is interest in that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text
[ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-725: Attachment: NovelAnalyzer.java Updated to work with Lucene 4 APIs. > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all > "boilerplate" text > --- > > Key: LUCENE-725 > URL: https://issues.apache.org/jira/browse/LUCENE-725 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Mark Harwood >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: NovelAnalyzer.java, NovelAnalyzer.java, > NovelAnalyzer.java, NovelAnalyzer.java > > > This is a class I have found to be useful for analyzing small (in the > hundreds) collections of documents and removing any duplicate content such > as standard disclaimers or repeated text in an exchange of emails. > This has applications in sampling query results to identify key phrases, > improving speed-reading of results with similar content (eg email > threads/forum messages) or just removing duplicated noise from a search index. > To be more generally useful it needs to scale to millions of documents - in > which case an alternative implementation is required. See the notes in the > Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4866) Lucene corruption
[ https://issues.apache.org/jira/browse/LUCENE-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608826#comment-13608826 ] Mark Harwood commented on LUCENE-4866: -- The fact that the missing file looks to be held on a shared drive might also be significant if there is >1 Lucene process configured to access the same directory ... > Lucene corruption > - > > Key: LUCENE-4866 > URL: https://issues.apache.org/jira/browse/LUCENE-4866 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 3.5 > Environment: Amazone tomcat cluster with NTFS. >Reporter: sachin >Priority: Blocker > > Hi all, > We know that lucene index gets corrupted. in our case they are corrupting > again and again due to this production is incosistent. followiing errors are > observed. Any help will be helpful. > org.hibernate.search.SearchException: Unable to reopen IndexReader > at > org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:230) > at > org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:73) > at > org.hibernate.search.reader.impl.MultiReaderFactory.openReader(MultiReaderFactory.java:49) > at > org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:596) > at > org.hibernate.search.query.engine.impl.HSQueryImpl.buildSearcher(HSQueryImpl.java:495) > at > org.hibernate.search.query.engine.impl.HSQueryImpl.queryEntityInfos(HSQueryImpl.java:239) > at > org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.list(FullTextQueryImpl.java:209) > at > com.lifetech.ngs.dataaccess.spring.util.SearchUtil.returnProjectionData(SearchUtil.java:646) > at > com.lifetech.ngs.dataaccess.spring.util.SearchUtil.getSinglePropertyOnlyUsingSearch(SearchUtil.java:556) > at > com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$FastClassByCGLIB$$568d5972.invoke() > at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) > at > org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) > at > org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) > at > org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622) > at > com.lifetech.ngs.dataaccess.spring.util.SearchUtil$$EnhancerByCGLIB$$47fb00d0.getSinglePropertyOnlyUsingSearch() > at > com.lifetech.ngs.server.impl.SampleManagerImpl.getNameSearchResult(SampleManagerImpl.java:2436) > at > com.lifetech.ngs.server.impl.SampleManagerImpl$$FastClassByCGLIB$$17af181d.invoke() > at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) > at > org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) > at > org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) > at > org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622) > at > com.lifetech.ngs.server.impl.SampleManagerImpl$$EnhancerByCGLIB$$75b745f9.getNameSearchResult() > at > com.lifetech.ngs.webui.mgc.widgets.sample.SearchSamplesView.populateData(SearchSamplesView.java:635) > at > com.lifetech.ngs.webui.customcomponents.IRAutoComplete.changeVariables(IRAutoComplete.java:39) > at > com.vaadin.terminal.gwt.server.AbstractCommunicationManager.changeVariables(AbstractCommunicationManager.java:1445) > at > com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleVariableBurst(AbstractCommunicationManager.java:1393) > at > com.lifetech.ngs.webui.main.SpringVaadinServlet$1.handleVariableBurst(SpringVaadinServlet.java:57) > at > com.vaadin.termi
Re: New Lucene features and Solr indexes
>>should be a stupid simple postings format like any other postings format with >>a default configuration It does have a default config. It just needs a PF delegate in the constructor just like Pulsing Like Rob said: >>In other words, it should work just like pulsing. So far so good. Now where people are getting upset (for no particularly good reason in my view) around per-field stuff: if you really, really want to you can supply a subclass of BloomFilterFactory to your BloomPF constructor which allows customised control over choice of hashing algo, bitset sizing and saturation policies if the DefaultBloomFilterFactory fails to make the right choices. 99.9% of people will not do this. The reason it is a factory object and not some dumb settings is that it is called on a per-segment basis with state info that is useful context in making sizing choices. Now, (horror of horrors), the factory's API is passed a FieldInfo object in the method designed to produce a bitset. It is conceivable that some rogue agents could choose to implement some per-field decisions here if the same BloomPF instance was registered to handle >1 field. In addition, BloomPF has some common-sense defensive coding that checks if the factory returns null for the bitset - in which case it delegates all calls un-bloomed directly to the delegate codec. None of this prevents the use of BloomPF with the prescribed PerFieldPF manner for handling field-specific choices. I happen to use a custom BloomFilterFactory to implement a more efficient indexing pipeline than the prescribed PerFieldPF route of implementing all per-field policies "up high" in the stack - but none of that is at the cost of a clean BloomPF API or with any unnecessary duplication of PerFieldPF logic. If anything needs changing here there may be a case for providing a convenience class that weds BloomPF and a default choice of Lucene40 codec so it can help with whatever Solr and other config-driven engines may need ie zero arg constructors if that's how their registry of codecs works. Cheers Mark From: Uwe Schindler To: dev@lucene.apache.org Sent: Wednesday, 13 February 2013, 16:47 Subject: RE: New Lucene features and Solr indexes Hi Shawn, I was arguing also at the time when this was committed. I fully agree with Robert, the current API is not in a good shape! I have the same feeling: Bloom Postings should be a stupid simple postings format like any other postings format with a default configuration. If you really want to change its configuration, you can subclass it as a separate postings format. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Shawn Heisey [mailto:s...@elyograg.org] > Sent: Wednesday, February 13, 2013 3:59 PM > To: dev@lucene.apache.org > Subject: Re: New Lucene features and Solr indexes > > >> BloomFilterPostingsFormat is a little special compared to other > >> postings formats because it can wrap any postings format. So maybe it > >> should require special support, like an additional attribute in the > >> field type definition? > > > > -1 > > > > Instead of making other APIs to accomodate BloomFilter's current > > brokenness: remove its custom per-field logic so it works with > > PerFieldPostingsFormat, like every other PF. > > > > In other words, it should work just like pulsing. > > > > I brought this up before it was committed, and i was ignored. Thats > > fine, but I'll be damned if i let its incorrect design complicate > > other parts of the codebase too. I'd rather it continue to stay > > difficult to integrate and continue walking its current path to an > > open source death instead. > > Robert, > > I have to send you a general thank you for your dedication to the quality of > this project, and for your amazing ability to seemingly keep the entire design > for Lucene in your head at all times. > > I'm not sure what exactly you want to die here, or what you think would be > the best option for me, the Solr end-user. Is BloomFilter something that's > not worth pursuing, or would you just like it to be integrated in a different > way? > > Thanks, > Shawn > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional > commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New Lucene features and Solr indexes
>>Instead of making other APIs to accomodate BloomFilter's current >>brokenness: remove its custom per-field logic so it works with >>PerFieldPostingsFormat, like every other PF. Not looked at it in a while but I'm pretty certain, like every other PF, you can go ahead and use PerFieldPF with Bloom filter just fine. What was broken was (is?) that in this configuration PFPF isn't smart enough to avoid creating twice as many files as is required - see Lucene 4093. Until that is resolved (and I have noted my pessimism about that being fixed easily) BloomPF contains an optimisation for those that want to avoid this inefficiency. The use of that optimisation is entirely optional for users. Internally to BloomPF, the implementation of that optimisation is trivial - if a null bloom set is returned for a given field it ignores the usual bloom filtering logic and delegates directly to the wrapped codec. You can choose to implement a BloomFilterFactory that adds this field-choice optimisation or, more simply run the default PerFieldPF-managed configuration and live with the increased numbers of files. Arguably, the inefficiencies of the PerFieldPF framework are the real issue to be addressed here. >>I brought this up before it was committed, and i was ignored You stopped engaging in the debate when I outlined the 3 proposed options for moving BloomPF forward : http://goo.gl/mxtP9 Those options were: 1) ignore the inefficiencies in PFPF 2) sort out the issues in PFPF (4093 but probably a more complex solution) 3) work around existing PFPF issues with a simple but entirely optional optimisation to BloomPF I opted for 3) and gave notice that I 'd take it out if anyone objected. I don't think there's been any movement on 2) so I guess you're still happy with option 1)? I recall you didn't think the business of extra files was that much of a concern: http://goo.gl/eJWo3 (Incidentally, probably best following up on the relevant Jiras rather than here) Cheers Mark From: Robert Muir To: dev@lucene.apache.org Sent: Wednesday, 13 February 2013, 13:01 Subject: Re: New Lucene features and Solr indexes On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand wrote: > Hi Shawn, > > On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey wrote: >> Some of these, like compressed stored fields and compressed termvectors, are >> being turned on by default, which is awesome. I'm already running a 4.2 >> snapshot, so I've got those in place. > > Excellent! > >> One thing that I know I would like to do is use the new BloomFilter for a >> couple of my fields that contain only unique values. Last time I checked >> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr >> had a BloomFilter postings format, but didn't have any way to specify the >> underlying format. See SOLR-3950 and LUCENE-4394. > > BloomFilterPostingsFormat is a little special compared to other > postings formats because it can wrap any postings format. So maybe it > should require special support, like an additional attribute in the > field type definition? -1 Instead of making other APIs to accomodate BloomFilter's current brokenness: remove its custom per-field logic so it works with PerFieldPostingsFormat, like every other PF. In other words, it should work just like pulsing. I brought this up before it was committed, and i was ignored. Thats fine, but I'll be damned if i let its incorrect design complicate other parts of the codebase too. I'd rather it continue to stay difficult to integrate and continue walking its current path to an open source death instead. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575864#comment-13575864 ] Mark Harwood commented on LUCENE-4768: -- OK - this problem seems to be about an ill-defined user query ("Saturn sky blue Sedan" with no explicit fields) being executed against a well-defined schema (cars with manufacturers, model names and bodyStyles that also have trims with colours). If that's the case you have a heap of problems here which aren't necessarily related to the "block join" implementation. One example - IDF ranking being what it is, if a manufacturer like Ford create a model called the "Blue" or you have bad data entry that has an example of this value stored in the wrong field then Lucene will naturally rank model:blue higher than color:blue because of the scarcity of the token "blue" in that field context. That's almost the inverse of what you want. A couple of suggestions for "field-less" queries like your example of "Saturn sky blue sedan" 1) Target the query on an unstructured "onebox" field that holds indexed content from all fields to achieve a more balanced IDF score. 2) Tokenize each item in the query string and find a "most likely" field for each search term by examining doc frequencies e.g. color:blue vs modelName:blue etc. Augment the "onebox" query in 1) with the most-likely-field interpretation for each word in the query string if it has sufficient doc frequency. > Child Traversable To Parent Block Join Query > > > Key: LUCENE-4768 > URL: https://issues.apache.org/jira/browse/LUCENE-4768 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Environment: trunk > git rev-parse HEAD > 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 >Reporter: Vadim Kirilchuk > Attachments: LUCENE-4768-draft.patch > > > Hi everyone! > Let me describe what i am trying to do: > I have hierarchical documents ('car model' as parent, 'trim' as child) and > use block join queries to retrieve them. However, i am not happy with current > behavior of ToParentBlockJoinQuery which goes through all parent childs > during nextDoc call (accumulating scores and freqs). > Consider the following example, you have a query with a custom post condition > on top of such bjq: and during post condition you traverse scorers tree > (doc-at-time) and want to manually push child scorers of bjq one by one until > condition passes or current parent have no more childs. > I am attaching the patch with query(and some tests) similar to > ToParentBlockJoin but with an ability to traverse childs. (i have to do weird > instance of check and cast inside my code) This is a draft only and i will be > glad to hear if someone need it or to hear how we can improve it. > P.s i believe that proposed query is more generic (low level) than > ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() > internally during nextDoc(). > Also, i think that the problem of traversing hierarchical documents is more > complex as lucene have only nextDoc API. What do you think about making api > more hierarchy aware? One level document is a special case of multi level > document but not vice versa. WDYT? > Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575825#comment-13575825 ] Mark Harwood commented on LUCENE-4768: -- Still not sure what problem you are trying to solve. bq. i need to know field and text for each matched leaf scorer Why? For scoring purposes? ToParentBJQ has a configurable ScoreMode to control if you want the max, avg or sum of the child matches rolled into the combined parent score. Is that insufficient control for your needs? > Child Traversable To Parent Block Join Query > > > Key: LUCENE-4768 > URL: https://issues.apache.org/jira/browse/LUCENE-4768 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Environment: trunk > git rev-parse HEAD > 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 >Reporter: Vadim Kirilchuk > Attachments: LUCENE-4768-draft.patch > > > Hi everyone! > Let me describe what i am trying to do: > I have hierarchical documents ('car model' as parent, 'trim' as child) and > use block join queries to retrieve them. However, i am not happy with current > behavior of ToParentBlockJoinQuery which goes through all parent childs > during nextDoc call (accumulating scores and freqs). > Consider the following example, you have a query with a custom post condition > on top of such bjq: and during post condition you traverse scorers tree > (doc-at-time) and want to manually push child scorers of bjq one by one until > condition passes or current parent have no more childs. > I am attaching the patch with query(and some tests) similar to > ToParentBlockJoin but with an ability to traverse childs. (i have to do weird > instance of check and cast inside my code) This is a draft only and i will be > glad to hear if someone need it or to hear how we can improve it. > P.s i believe that proposed query is more generic (low level) than > ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() > internally during nextDoc(). > Also, i think that the problem of traversing hierarchical documents is more > complex as lucene have only nextDoc API. What do you think about making api > more hierarchy aware? One level document is a special case of multi level > document but not vice versa. WDYT? > Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4768) Child Traversable To Parent Block Join Query
[ https://issues.apache.org/jira/browse/LUCENE-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575740#comment-13575740 ] Mark Harwood commented on LUCENE-4768: -- As with any discussion about nested queries you need to be very clear about the required logic. When you talk about matching f1:A or f1:B - are we talking about matches on the same child doc or possibly matches on different child docs of the same parent? The examples don't make this clear. If we assume your child-based criteria is focused on examining the contents of single children (as opposed to combining f1:A on one child doc with f1:B on a different child doc) then a BooleanQuery that combines these child query elements will already be sufficient for skipping through children. Not really sure what you are trying to optimize anyway with skipping - parent-child combos are limited to what fits into a single segment which is in turn limited by RAM. You don't generally get parents with "many many" children because of these constraints. The "nextDoc" calls you are trying to skip are related to a compressed block of child doc IDs (gap encoded varints) that are read off disk in 1K chunks (if I recall default Directory settings correctly). The chances are high that the limited number of child docIDs that belong to each parent are already in RAM as part of normal disk access patterns so there is no real saving in disk IO. Are you sure this is a performance bottleneck? > Child Traversable To Parent Block Join Query > > > Key: LUCENE-4768 > URL: https://issues.apache.org/jira/browse/LUCENE-4768 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring > Environment: trunk > git rev-parse HEAD > 5cc88eaa41eb66236a0d4203cc81f1eed97c9a41 >Reporter: Vadim Kirilchuk > Attachments: LUCENE-4768-draft.patch > > > Hi everyone! > Let me describe what i am trying to do: > I have hierarchical documents ('car model' as parent, 'trim' as child) and > use block join queries to retrieve them. However, i am not happy with current > behavior of ToParentBlockJoinQuery which goes through all parent childs > during nextDoc call (accumulating scores and freqs). > Consider the following example, you have a query with a custom post condition > on top of such bjq: and during post condition you traverse scorers tree > (doc-at-time) and want to manually push child scorers of bjq one by one until > condition passes or current parent have no more childs. > I am attaching the patch with query(and some tests) similar to > ToParentBlockJoin but with an ability to traverse childs. (i have to do weird > instance of check and cast inside my code) This is a draft only and i will be > glad to hear if someone need it or to hear how we can improve it. > P.s i believe that proposed query is more generic (low level) than > ToParentBJQ and ToParentBJQ can be extended from it and call nextChild() > internally during nextDoc(). > Also, i think that the problem of traversing hierarchical documents is more > complex as lucene have only nextDoc API. What do you think about making api > more hierarchy aware? One level document is a special case of multi level > document but not vice versa. WDYT? > Thanks in advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477036#comment-13477036 ] Mark Harwood commented on SOLR-3950: bq. If there is some schema config that will tell Solr to do the right thing, please let me know. Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as to what delegate it will use before you can use it at write-time. I think we have 3 options: 1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF e.g. Lucene40PF so you would have a zero-arg-constructor class named something like BloomLucene40PF or... 2) Solr extends config file format to provide a generic means of assembling "wrapper" PFs like Bloom in their config e.g: postingsFormat="BloomFilter" delegatePostingsFormat="FooPF" and Solr then does reflection magic to call constructors appropriately or.. 3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for BloomPF. Of these 1) feels like "the right thing". Cheers Mark > Attempting postings="BloomFilter" results in UnsupportedOperationException > -- > > Key: SOLR-3950 > URL: https://issues.apache.org/jira/browse/SOLR-3950 > Project: Solr > Issue Type: Bug >Affects Versions: 4.1 > Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 > SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > [root@bigindy5 ~]# java -version > java version "1.7.0_07" > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) >Reporter: Shawn Heisey > Fix For: 4.1 > > > Tested on branch_4x, checked out after BlockPostingsFormat was made the > default by LUCENE-4446. > I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and > copied it into my sharedLib directory. When I subsequently tried > postings="BloomFilter" I got a the following exception in the log: > {code} > Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.UnsupportedOperationException: Error - > org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been > constructed without a choice of PostingsFormat > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476854#comment-13476854 ] Mark Harwood commented on SOLR-3950: BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat and adds ".blm" files to the other files created by the choice of delegate. However your code has instantiated a BloomFilterPostingsFormat without passing a choice of delegate - presumably using the zero-arg constructor. The comments in the code for this zero-arg constructor state: // Used only by core Lucene at read-time via Service Provider instantiation - // do not use at Write-time in application code. > Attempting postings="BloomFilter" results in UnsupportedOperationException > -- > > Key: SOLR-3950 > URL: https://issues.apache.org/jira/browse/SOLR-3950 > Project: Solr > Issue Type: Bug >Affects Versions: 4.1 > Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 > SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > [root@bigindy5 ~]# java -version > java version "1.7.0_07" > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) >Reporter: Shawn Heisey > Fix For: 4.1 > > > Tested on branch_4x, checked out after BlockPostingsFormat was made the > default by LUCENE-4446. > I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and > copied it into my sharedLib directory. When I subsequently tried > postings="BloomFilter" I got a the following exception in the log: > {code} > Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.UnsupportedOperationException: Error - > org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been > constructed without a choice of PostingsFormat > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work
[ https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476044#comment-13476044 ] Mark Harwood commented on LUCENE-3772: -- For bigger-than-memory docs is it not possible to use nested documents to represent subsections (e.g. a child doc for each of the chapters in a book) and then use BlockJoinQuery to select the best child docs? Highlighting can then be used on a more-manageable subset of the original content and Lucene's ranking algos are being used to select the best "fragment" rather than the highlighter's own attempts to reproduce this logic. Obviously depends on the shape of your content/queries but books-and-chapters is probably a good fit for this approach. > Highlighter needs the whole text in memory to work > -- > > Key: LUCENE-3772 > URL: https://issues.apache.org/jira/browse/LUCENE-3772 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 3.5 > Environment: Windows 7 Enterprise x64, JRE 1.6.0_25 >Reporter: Luis Filipe Nassif > Labels: highlighter, improvement, memory > > Highlighter methods getBestFragment(s) and getBestTextFragments only accept a > String object representing the whole text to highlight. When dealing with > very large docs simultaneously, it can lead to heap consumption problems. It > would be better if the API could accept a Reader objetct additionally, like > Lucene Document Fields do. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452914#comment-13452914 ] Mark Harwood commented on LUCENE-4369: -- Agreed on the need for a change - names are important. I have a problem with using "match" on its own because the word is often associated with partial matching e.g. "best match" or "fuzzy match". A quick google suggests "match" has more connotations with fuzziness than exactness - there are 162m results for "best match" vs only 45m results for "exact match". So how about "ExactMatchField"? > StringFields name is unintuitive and not helpful > > > Key: LUCENE-4369 > URL: https://issues.apache.org/jira/browse/LUCENE-4369 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-4369.patch > > > There's a huge difference between TextField and StringField, StringField > screws up scoring and bypasses your Analyzer. > (see java-user thread "Custom Analyzer Not Called When Indexing" as an > example.) > The name we use here is vital, otherwise people will get bad results. > I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452900#comment-13452900 ] Mark Harwood commented on LUCENE-4369: -- SingleTermField ? Not sure "matching vs searching" is a commonly understood differentiation. > StringFields name is unintuitive and not helpful > > > Key: LUCENE-4369 > URL: https://issues.apache.org/jira/browse/LUCENE-4369 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-4369.patch > > > There's a huge difference between TextField and StringField, StringField > screws up scoring and bypasses your Analyzer. > (see java-user thread "Custom Analyzer Not Called When Indexing" as an > example.) > The name we use here is vital, otherwise people will get bad results. > I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433045#comment-13433045 ] Mark Harwood commented on LUCENE-4069: -- bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact use case. Fair enough - the original patch targeted Lucene 3.6 which benefited more heavily from this technique. The issue then morphed into a 4.x patch where performance gains were harder to find. I think the sweet spot is in primary key searches on indexes with ongoing heavy changes (more segment fragmentation, less OS-level caching?). This is the use case I am targeting currently and my final tests using our primary-key-counting test rig saw a 10 to 15% improvement over Pulsing. bq. I'm asking because I need his feature but I'm stuck with 3.x for a while. I have a client in a similar situation who are contemplating using the 3.6 patch. bq. Is there bugs which should be fixed in initial 3.6 patch? It has been a while since I looked at it - a quick run of "ant test" on my copy here showed no errors. I will be giving it a closer review if my client decides to go down this route and can post any fixes here. I expect if you use the patch and get into trouble you can use an un-patched version of 3.6 to read the same index files (it should just ignore the extra "blm" files created by the patched version). > Segment-level Bloom filters > --- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0-BETA, 5.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Fix Version/s: 5.0 Applied to trunk in revision 1368567 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood > Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0, 5.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427322#comment-13427322 ] Mark Harwood commented on LUCENE-4069: -- Will do. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-4069. -- Resolution: Fixed Assignee: Mark Harwood Committed to 4.0 branch, revision 1368442 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood > Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated patch to bring in line with latest core API changes. All tests now pass clean so will commit soon > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, > LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated with fix to issue explored in Lucene-4275 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, > LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-4275. Resolution: Not A Problem > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 > Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426481#comment-13426481 ] Mark Harwood commented on LUCENE-4275: -- Nailed it, Mike. Yet another beer I owe you. I removed the IllegalStateException and it looks like the retry logic is now kicking in and all tests pass This reliance on throwing a particular exception type feels like an important contract to document. Currently the comments in PostingsFormat.fieldsProducer() read as follows: bq. Reads a segment. NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. I propose adding: bq. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments I'll roll that documentation addition into my Lucene-4069 patch > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 >Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425895#comment-13425895 ] Mark Harwood commented on LUCENE-4275: -- Thanks, Rob. This test requires a call to "ant clean" between each run before it will consistently work. However, I don't consider that a fix and assume that we are still looking for a bug here as there's an index consistency issue lurking somewhere here. I've tried adding the setting -Dtests.directory=RAMDirectory but the test still looks to have some "memory" between runs. I added some logging of creates and deletes as you suggest and it looks like on a second, un-cleansed run, my PF is being called to open a high-numbered segment which I suspect was created by an earlier run as the logging doesn't show signs of the PF being asked to created content for this (or any other) segment as part of the current run. At this point it fails as there is no longer a copy of the "foobar" file listed by the directory. I have noticed in the logs from previous runs MDW is asked to delete the segment's "foobar" file by IndexWriter as part of compaction into a compound CFS. Hope this sheds some light as I'm finding this a complex one to debug. > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test > Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 >Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4275: - Attachment: Lucene-4275-TestClass.patch Attached simple PostingsFormat used to illustrate cases of files going missing in PF tests. > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 > Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
Mark Harwood created LUCENE-4275: Summary: Threaded tests with MockDirectoryWrapper delete active PostingFormat files Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: 4069Failure.zip Attached a log of thread activity showing how TestIndexWriterCommit.testCommitThreadSafety() is failing. At this stage I can't tell if this is a failing in MockDirectoryWrapper or the test or the BloomPF class but it is related to files being removed unexpectedly. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418411#comment-13418411 ] Mark Harwood commented on LUCENE-4069: -- bq. I wonder if it has to do w/ only opening the file in the close() method ( Just tried opening the file earlier (in BloomFilteredConsumer constructor) and that didn't fix it. I previously also added an extra Directory.fileExists() sanity check immediately after closing the IndexOutput and all was well so I think it's something happening after that. Will need to dig deeper. I'm running on WinXP 64bit if that is of any significance to MDW's behaviour. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418314#comment-13418314 ] Mark Harwood commented on LUCENE-4069: -- One more remaining issue before I commit which has appeared sporadically and looks to be consistently raised by this test: ant test -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=ISO-8859-1 The error it produces is this: [junit4:junit4]> Caused by: java.lang.IllegalStateException: Missing file:_9_TestBloomFilteredLucene40Postings_0.blm [junit4:junit4]>at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.(BloomFilteringPostingsFormat.java:175) [junit4:junit4]>at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156) MockDirectoryWrapper looks to be randomly deleting files (probably my "blm" file shown above) to simulate the effects of crashes. Presumably I am doing the "right thing" in always throwing an exception if the .blm file is missing? The alternative would be to silently ignore the missing file which seems undesirable. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment? > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416383#comment-13416383 ] Mark Harwood commented on LUCENE-4069: -- A quick benchmark looks like the new right-sized bitset as opposed to the old worst-case-scenario-sized bitset is buying us a small performance improvement. bq. I also don't think this PF should be per-field There was a lengthy discussion earlier on this topic. The approach presented here seems reasonable. For the average user you have the DefaultBloomFilterFactory default which now has reasonable sizing for all fields passed its way (assuming a heuristic based on numDocs=numKeys to anticipate). For expert users you can provide a BloomFilterFactory with a custom choice of sizing heuristic per-field and can also simply return "null" for non-bloomed fields. Having a single, carefully configured BloomPF wrapper is preferable because you can channel appropriately configured bloom settings to a common PF delegate and avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. You don't *have* to use the Per-field stuff in BloomPF but there are benefits to be had in doing so which can't otherwise be achieved. bq. Can you add @lucene.experimental to all the new APIs? Done. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch New patch with use of SegmentWriteState to right-size the choice if bitset for volume of content. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416084#comment-13416084 ] Mark Harwood commented on LUCENE-4069: -- bq. MessageDigest.getInstance(name) should be the way to go I'm less keen now - a quick scan of the docs around MessageDigest throws up some issues: 1) SPI registration of MessageDigest providers looks to get into permissions hell as it is closely related to security - see http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling which talks about the steps required to approve a trusted "provider". 2) MessageDigest as an interface is designed to stream content in potentially many method calls past the hashing algo. MurmurHash2.java is not currently written to process content this way and suits our needs in hashing small blocks of content in one hit. For these 2 reasons it looks like MessageDigest may be a pain to adopt and the existing approach proposed in this patch may be preferable. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416037#comment-13416037 ] Mark Harwood commented on LUCENE-4069: -- bq. If a special decoder for foobar is needed, it must be loadable by SPI. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that... > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416007#comment-13416007 ] Mark Harwood commented on LUCENE-4069: -- bq. At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? bq. Hmm why is anonymity at search time important? It would seem to be an established design principle - see https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726 It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415362#comment-13415362 ] Mark Harwood commented on LUCENE-4069: -- bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. bq. Seems like LUCENE-4198 needs to solve this same problem. Another possibly related point on more access to "merge context" - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot "trending" terms whose term frequencies differ drastically between the merging source segments. This would require access to "source segment" as term postings are streamed to observe the change in counts. bq. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing. bq. can't the postings format impl pass in an instance of HashFunction when making the FuzzySet I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function. I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results. bq. Can you move the imports under the copyright header? Will do > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added bloom package.html and changes.txt. I plan to commit in a day or two if there are no objections. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410145#comment-13410145 ] Mark Harwood commented on LUCENE-4069: -- bq. So now we are close to 1M lookups/sec for a single thread! Cool! bq. I wonder if somehow we can do a better job picking the right sized bit vector up front? bq. You basically need to know up front how many unique terms will be in the given field for this segment right? Yes - the job of anticipating the number of unique keys probably has 2 different contexts: 1) Net new segments e.g. guessing up front how many docs/keys a user is likely to generate in a new segment before the flush settings kick in. 2) Merged segments e.g. guessing how many unique keys survive a merge operation Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Arguably the BloomFilterFactory.getSetForField() method already represents where this setting can be controlled. In context 2 where potentially large merges occur we could look at adding an extra method to BloomFilterFactory to handle this different context e.g. something like FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext) Based on the size of the segments being merged and volumes of deletes a more appropriate size of Bloom bitset could be allocated based on a worst-case estimate. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context? > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408097#comment-13408097 ] Mark Harwood commented on LUCENE-4069: -- Thanks for the extra tests, Mike. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation :) I've done some more benchmarks with the updated test and the performance characteristics are becoming clearer as shown in these results: http://goo.gl/dtWSb Bloom performance is better than Pulsing but the gap narrows with the volumes of deletes lying around in old segments, caused by updates. In these cases the BloomFilter gives a false positive and falls back to the equivalent operations of Pulsing. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. So overall there still looks to be a benefit and especially in low-frequency update scenarios. I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a new file) before thinking about committing. Cheers Mark > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Updated performance test with option to alter the ratio of inserts vs updates via keyspace size. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files
[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407099#comment-13407099 ] Mark Harwood commented on LUCENE-4190: -- -1 for merrily wiping contents of whatever directory a user happens to pick for an index location +0 on requiring all codecs to declare filenames because I take on board Rob's points re complexity +1 for the "_*" name-spacing proposal as a sensible compromise > IndexWriter deletes non-Lucene files > > > Key: LUCENE-4190 > URL: https://issues.apache.org/jira/browse/LUCENE-4190 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Robert Muir > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4190.patch, LUCENE-4190.patch > > > Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog > post: > http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html > IndexWriter will now (as of 4.0) delete all foreign files from the index > directory. We made this change because Codecs are free to write to any files > now, so the space of filenames is hard to "bound". > But if the user accidentally uses the wrong directory (eg c:/) then we will > in fact delete important stuff. > I think we can at least use some simple criteria (must start with _, maybe > must fit certain pattern eg _(_X).Y), so we are much less likely to > delete a non-Lucene file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added customizable saturation threshold after which Bloom filters are retired and no longer maintained (due to merges creating v large segments) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Attached a performance test (adapted from Mike's PKLookupPerfTest) that demonstrates the worst-case scenario where BloomFilter offers the 2x speed up not previously revealed in Mike's other tests. This test case mixes reads and writes on a growing index and is representative of the real-world scenario I am seeking to optimize. See the javadoc for test details. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Greg Bowyer
Good to have you aboard, Greg! - Original Message - From: Erick Erickson To: dev@lucene.apache.org Cc: Sent: Thursday, 21 June 2012, 11:56 Subject: Welcome Greg Bowyer I'm pleased to announce that Greg Bowyer has been added as a Lucene/Solr committer. Greg: It's a tradition that you reply with a brief bio. Your SVN access should be set up and ready to go. Congratulations! Erick Erickson - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKeyPerfTest40.java Updated Performance test code based on new IndexReader changes for accessing subreaders > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org