[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Da Huang updated LUCENE-4396: ----------------------------- Attachment: merge.png merge.perf LUCENE-4396.patch This is a patch based on git mirror commit a5a2e716ebcba1a201c4934f336ae9c0fcb551bf . In this patch, I have fixed a bug of wrong coord counting. Besides, I have come up with an awesome idea on choosing scorer and implemented this idea in this patch. The following is the story of this idea. After I have completed performance tests for each scorer, I plotted figures for the results, so that I can have an intuitive view on natures of each scorers. The figures are showed as follows. !perf.png! Then, I discovered that the performance of each scorer can probably be fitted by a straight line. It may be confusing that there're several points which look distinctive, such as (8, -20) on BAS, in LowAndNLowOr case. However, when I retested BAS again, its performance went to 10 with N = 8. Therefore, I just consider these 'distinctive' points as noisy points. So, the following things to do is to get those performance curves' expressions. Firstly, just have a look at the perf. figures again. We can find that BAS can get the best performance on average, so we just discuss BAS here. The expressions of each performance curve' fitting line are showed as follows. (Suppose that the horizontal axis is 'x', while the vertical axis is 'y') || LowAndNLowOr || LowAndNHighOr || HighAndNLowOr || HighAndNHighOr || | y = 5.33x - 31 | y = 3.83x - 8.5 | y = 1.67x - 55 | y = 7.5x - 32.5 | || LowAndNLowNot || LowAndNHighNot || HighAndNLowNot || HighAndNHighNot || | y = 4.5x - 18.5 | y = 3x - 7 | y = -0.83x - 22.5 | y = 7x - 31 | I got these expressions just by visual estimation. You can also get similiar expressions by drawing a straight line between the first and last point on each firgure. Now suppose that the general performance expression is y = A \* x + B . In lucene/BooleanQuery, the only information we have is requiredCost and optionalCost(or prohibitedCost). For convenience, Let's just symbolize these two values as 'a' and 'b' respectively. If we can find two functions, f and g, which have A = f(a, b), B = g(a, b), we can get the performance curve in the program. For convenience, we just discuss A = f(a, b) here, and the case of B is just similiar to A. The same, we just discuss \*Or cases here, and \*Not cases are just similiar ones. Here, we know the values of a and b for each case. || LowAndNLowOr || LowAndNHighOr || HighAndNLowOr || HighAndNHighOr || | a = L, b = L | a = L, b = H | a = H, b = L | a = H, b = H | Among, L represents a low cost, while H represents a high cost. We can evaluate these two value by doing a statistics on wikimedium.10M.nostopwords.tasks in luceneutil. Here, their evaluated values are: {code} H = 747310, L = 34750 {code} As, you can see, the values of H and L are too high. Here, we get their log value; that is {code} h = log(H), l = log(L) {code} Suppose that f is formatted as {code} f(a,b) = k1 * u1(a, b) + k2 * u2(a, b) + k3 * u3(a, b) + k4 * u4(a, b) {code} Thus, we have {code} k1 * u1(l, l) + k2 * u2(l, l) + k3 * u3(l, l) + k4 * u4(l, l) = 5.33 k1 * u1(l, h) + k2 * u2(l, h) + k3 * u3(l, h) + k4 * u4(l, h) = 3.84 k1 * u1(h, l) + k2 * u2(h, l) + k3 * u3(h, l) + k4 * u4(h, l) = 1.67 k1 * u1(h, h) + k2 * u2(h, h) + k3 * u3(h, h) + k4 * u4(h, h) = 7.5 {code} The following question is how to choose ui(a, b). Actually, I have tried many formulations, and I found the following is the best. ||u1(a,b)||u2(a,b)||u3(a,b)||u4(a,b)|| | a | b | a\*b | a\*b/(a+b)| I think such setup has its physical meanings. u1 and u2 are the influence factors of a and b respectively. u3 represents the higher dimentional factor. u4 is half of the harmonic mean. Thus, we have {code} [ l l l*l l ] [ k1 ] [5.33] [ l h l*h l*h/(l+h) ] * [ k2 ] = [3.84] [ h l h*l l*h/(l+h) ] [ k3 ] [1.67] [ h h h*h h ] [ k4 ] [7.5 ] that is [ 10.4559 10.4559 109.3266 5.2280 ] [ 10.4559 13.5242 141.4085 5.8969 ] * [k] = [A] [ 13.5242 10.4559 141.4085 5.8969 ] [ 13.5242 13.5242 182.9049 6.7621 ] or symbolized [U] * [k] = [A] {code} Luckily, \[U\] is a good matrix, which means that its inverse matrix is 'calculable'. {code} [ -1.4365 1.1106 1.4365 -1.1106 ] inv([U]) = [ -1.4365 1.4365 1.1106 -1.1106 ] [ -0.0312 0.0000 0.0000 0.0241 ] [ 6.5893 -5.0943 -5.0943 3.9386 ] [k] = inv([U]) * [A] = [-9.3396 -8.6334 0.0145 36.6630]' {code} Now, in the program, we can get A by, {code} A = f(a,b) = k1 * a + k2 * b + k3 * a * b + k4 * a * b / (a + b) {code} and get B in a similiar way. Finally, we get the evaluated fitting straight line of BAS in a specific case. y = A \* x + B x is the number of optional(or prohibited) scorers. Thus, we can use BAS, when A * N + B > THERHOLD. Here, I set the THERHOLD as 5. I have tested this idea with medium cost cases included. And it turns out that all cases performs well. See the original performance file [merge.perf] and the following figures. !merge.png! > BooleanScorer should sometimes be used for MUST clauses > ------------------------------------------------------- > > Key: LUCENE-4396 > URL: https://issues.apache.org/jira/browse/LUCENE-4396 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, > LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, > LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, > LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, > LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, > luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge.perf, > merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp > > > Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. > If there is one or more MUST clauses we always use BooleanScorer2. > But I suspect that unless the MUST clauses have very low hit count compared > to the other clauses, that BooleanScorer would perform better than > BooleanScorer2. BooleanScorer still has some vestiges from when it used to > handle MUST so it shouldn't be hard to bring back this capability ... I think > the challenging part might be the heuristics on when to use which (likely we > would have to use firstDocID as proxy for total hit count). > Likely we should also have BooleanScorer sometimes use .advance() on the subs > in this case, eg if suddenly the MUST clause skips 1000000 docs then you want > to .advance() all the SHOULD clauses. > I won't have near term time to work on this so feel free to take it if you > are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org