Re: [GoSC] I'm interested in LUCENE-3333

2014-03-10 Thread Michael McCandless
On Sun, Mar 9, 2014 at 10:46 AM, Da Huang dhuang...@gmail.com wrote:
 Thanks a lot. That's very helpful.

 I think you get exactly what I mean about the LUCENE-4396.
 By grouping up the MUST clauses, the conjunctive query can be
 done specifiedly with easy way. Then, the original query would have
 no more than 1 MUST clause. I think in this situation, it's much more
 easier to judge whether to use BooleanScorer or BooleanScorer2. :)

Well, because we now have the DISI.cost() method, we can use this to
find the least-cost MUST clause (e.g. the one matching the fewest
documents) and then make a call up front on whether BS or BS2 is
appropriate.

But these would all be fun ideas to explore under a GSoC project, if
we can scope it appropriately.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Michael McCandless
Hi, Da,

On Sun, Mar 9, 2014 at 1:30 AM, Da Huang dhuang...@gmail.com wrote:

 I have spent some time considering your suggestions in last mail. I find
 that I'm interested in the suggestion  Filter and Query should be more
 'combined' .

OK, cool, and ambitious; it might be safer to choose a less
ambitious/controversial change for a GSoC project.

Maybe, have a look at LUCENE-1518?  There was lots of discussion there.

 In my opinion, to implement this suggestion, a new class FilterQuery,
 which is a subclass of Query,  should be created. If FilterQuery is
 implemented, then it can be the query element of BooleanClause, and the
 BooleanQuery can naturally add a Filter as a BooleanClause. I think
 one of the most important things is to deal with the scores, as Filter does
 not contribute anything to score.

I feel like it should be the opposite?  Like, a Filter has less
functionality that a Query, because it does only matching?  So I would
think a Quey would subclass Filter and then add scoring onto it?  But
there was lots of discussion on the above issue that I don't
remember...

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Da Huang
Hi, Mike.

You're right. After having a look at the comments on LUCENE-1518, I find
that my idea about that has many bugs. Sorry for that.

Thus, I have checked some other suggestions you gave me to see whether
relevant comments can be found in jira.

I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be
used for MUST clauses.
Can we adjust the query to make the problem easier? For the query +a b c
+e +f as an example, maybe we can
turn it into (+a +e +f) b c which has only one MUST clause. Then, it
would be easier to judge which scorer to use?

Besides, I seems that the suggestion we should pass a needsScorers boolean
up-front to Weight.scorer
is not on jira. But it sounds that it can be done by adjusting some class
methods' arguments and return value
to pass the needsScorers? not sure.

At last, recently I find something strange in the code about heap. I find
heap has been implemented duplicately
for many times in the trunk, and a PriorityQueue is also implemented in the
package org.apache.lucene.util.
I remember java has already implemented the PriorityQueue. Why not use that?


Thanks,
Da Huang


-- 
黄达(Da Huang)
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China


Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Michael McCandless
On Sun, Mar 9, 2014 at 9:55 AM, Da Huang dhuang...@gmail.com wrote:
 Hi, Mike.

 You're right. After having a look at the comments on LUCENE-1518, I find
 that my idea about that has many bugs. Sorry for that.

It's fine, it's a VERY hard fix :)  This is why it hasn't been done yet!

 Thus, I have checked some other suggestions you gave me to see whether
 relevant comments can be found in jira.

 I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be
 used for MUST clauses.
 Can we adjust the query to make the problem easier? For the query +a b c +e
 +f as an example, maybe we can
 turn it into (+a +e +f) b c which has only one MUST clause. Then, it would
 be easier to judge which scorer to use?

You mean create nesting when there wasn't before, by grouping all MUST
clauses together?  We could explore that ...

Or we could pass all the clauses (still flat) to BooleanScorer.  I
think this would only be faster when the MUST clauses are high cost
relative to all other clauses.  E.g. a super-rare MUST'd clause would
probably be faster with BooleanScorer2.

I think this could make a good GSoC project.

 Besides, I seems that the suggestion we should pass a needsScorers boolean
 up-front to Weight.scorer
 is not on jira. But it sounds that it can be done by adjusting some class
 methods' arguments and return value
 to pass the needsScorers? not sure.

I think it's this Jira: https://issues.apache.org/jira/browse/LUCENE-3331

(I just searched for needs scores on
http://jirasearch.mikemccandless.com and it was one of the
suggestions).

All that should be needed here is to add a boolean needsScores (or
something) to the Weight.scorer method, and fix the numerous places
where this method is invoked to pass the right value.  E.g.
ConstantScoreQuery would pass false, and this would mean e.g. if it
wraps a TermQuery, we could avoid decoding freq blocks from the
postings.

 At last, recently I find something strange in the code about heap. I find
 heap has been implemented duplicately
 for many times in the trunk, and a PriorityQueue is also implemented in the
 package org.apache.lucene.util.
 I remember java has already implemented the PriorityQueue. Why not use that?

Good question!  There is a fair amount of duplicated code, and we
should fix that over time.  Lucene has had its own PQ class forever,
and we do strange things like pre-filling the queue with a sentinel
value to avoid if (queueIsNotFullYet) checks in collect(int doc),
and we can replace the top value and re-heap ... but maybe these do
not in fact matter in practice and if so we should stop duplicating
code :)

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Da Huang
Thanks a lot. That's very helpful.

I think you get exactly what I mean about the LUCENE-4396.
By grouping up the MUST clauses, the conjunctive query can be
done specifiedly with easy way. Then, the original query would have
no more than 1 MUST clause. I think in this situation, it's much more
easier to judge whether to use BooleanScorer or BooleanScorer2. :)


Thanks,
Da Huang


-- 
黄达(Da Huang)
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China


Re: [GoSC] I'm interested in LUCENE-3333

2014-03-08 Thread Michael McCandless
On Fri, Mar 7, 2014 at 9:01 PM, Da Huang dhuang...@gmail.com wrote:
 Hello, everyone,

 My name is Da Huang. I'm studying for my master degree of Computer Science
 in Peking University. I have been using lucene for about half a year. It's
 so elegent that I hope to have a chance to contribute some code for it.

Welcome!

 Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for
 several days. I find LUCENE-: Specialize DisjunctionScorer if all
 clauses are TermQueries more suitable for me to do. I have spent some time
 to scan the revelant code, and the Issue LUCENE-3328 which spinoff
 LUCENE-. I find the following questions confusing me.

 1) I have checkout the code from
 http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk, but I
 couldn't find the relevant code of the fixed Issue LUCENE-3328. It seems
 that the patch attached on the page is not on the trunk. Why?

Well, some time after LUCENE-3328, we made further changes and
discovered that this code specialized scorer was not in fact [that
much?] faster.  I forget which issue removed it... but you could
probably find it with some svn archaeology.

Net/net the trend in Lucene has been against adding source code
specialization, since this is really code duplication to try to make
hotspot's life easier.  Unfortunately, it does sometimes work; e.g see
http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html
though that's not a fair comparison since it was also a different
programming language!  So, while it's a nice tradeoff for performance,
it's a poor tradeoff for ongoing code management.  See all the
specialized collectors we have in TopFieldCollector!

So I'm not sure at this point if we should even pursue LUCENE-.
There are however tons of other things to fix on the search side;
maybe we could craft a good GSoC project from something else; e.g.:

  - we should pass a needsScorers boolean up-front to Weight.scorer
  - disjunctions now score during matching

  - BooleanScorer should sometimes be used for MUST clauses

  - We sort of duplicate code across BooleanQuery, FilteredQuery,
BooleanFilter, TermsFilter

  - Somehow, Filter and Query should be more combined; e.g. you
should be able to add a Filter as a clause onto a BooleanQuery

  - Post filtering is too hard to use today

  - ...

 2)  My intuitive idea of solving this issue is to make a class
 DisjunctionTermScorer to do the all TermQueries clauses; then, judging
 whether to use DisjunctionTermScorer in the method 'scorer' in class
 BooleanQuery. Is this idea right?

Yes this would be the right idea.

 Above are my questions about LUCENE-. Besides, I would like to propose
 the following issue which is about the QueryParser.

 When we use QueryParser to parse a querystring like science AND
 (engineering AND technology). The generated query would be +science
 (+engineering +technology). I think it would be more efficient for
 searching if the final query is +science +engineering +technology. My idea
 is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I
 have made my idea clear.

I think this would make tons of sense; the only challenge is that
this will change how scores are computed, when coord is enabled.  I'm
not sure how much that'd matter in practice; if it is important to
preserve that, then maybe we could still make a single 3-clause
BooleanQuery, but somehow remember the original structure for the sake
of coord scoring ... not sure.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GoSC] I'm interested in LUCENE-3333

2014-03-08 Thread Da Huang
Hello, Mike.

I have spent some time considering your suggestions in last mail. I find
that I'm interested in the suggestion  Filter and Query should be more
'combined' .

In my opinion, to implement this suggestion, a new class FilterQuery,
which is a subclass of Query,  should be created. If FilterQuery is
implemented, then it can be the query element of BooleanClause, and the
BooleanQuery can naturally add a Filter as a BooleanClause. I think
one of the most important things is to deal with the scores, as Filter does
not contribute anything to score.

Above is my intuitive idea about this suggestion. Do you think it makes
sense? I hope I have made my idea clear.


Thanks,
Da Huang


-- 
黄达(Da Huang)
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China