[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704499#action_12704499
 ] 

Shai Erera commented on LUCENE-1518:


I would like to query why do we need to make Filter and Query of the same type? 
After all, they both do different things, even though it looks like they are 
similar. Attempting to do this yields those peculiarities:
# If Filter extends Query, it now has to implement all sorts of methods like 
weight, toString, rewrite, getTerms and scoresDocInOrder (an addition from 
LUCENE-1593).
# If Query extends Filter, it has to implement getDocIdSet.
# Introduce instanceof checks in places just to check if a given Query is 
actually a Filter or not.

Both (1) and (2) are completely redundant for both Query and Filter, i.e. why 
should Filter implement toString(term) or scoresDocInOrder when it does score 
docs? Why should Query implement getDocIdSet when it already implements a 
weight().scorer() which returns a DocIdSetIterator?

I read the different posts on this issue and I don't understand why we think 
that the API is not clear enough today, or is not convenient:

* If I want to just filter the entire index, I have two ways: (1) execute a 
search with MatchAllDocsQuery and a Filter (2) Wrap a filter with 
ConstantScoreQuery. I don't see the difference between the two, and I don't 
think it forces any major/difficult decision on the user.
* If I want to have a BooleanQuery with several clauses and I want a clause to 
be a complex one with a Filter, I can wrap the Filter with CSQ.
* If I want to filter a Query, there is already API today on Searcher which 
accepts both Query and Filter.

At least as I understand it, Queries are supposed to score documents, while 
Filters to just filter. If there is an API which requires Queries only, then I 
can wrap my Filter with CSQ, but I'd prefer to check if we can change that API 
first (for example, allowing BooleanClause to accept a Filter, and implement a 
weight(IndexReader) rather than just getQuery()).

So if Filters just filter and Queries just score, the API on both is very 
clear: Filter returns a DISI and Query returns a Scorer (which is also a DISI). 
I don't see the advantage of having the code unaware to the fact a certain 
Query is actually a Fitler - I prefer it to be upfront. That way, we can do all 
sorts of optimizations, like asking the Filter for next() first, if we know 
it's supposed to filter most of the documents.

At the end of the day, both Filter and Query iterate on documents. The 
difference lies in the purpose of iteration. In my code there are several Query 
implementations today that just filter documents, and I plan to change all of 
them to implement Filter instead (that was originally the case because Filter 
had just bits() and now it's more efficient with the iterator() version, at 
least to me). I want to do this for a couple of reasons, clarity being one of 
the most important. If Filter just filters, I don't see why it should inherit 
all the methods from Query (or vice versa BTW), especially when I have this CSQ 
wrapper.
To me, as a Lucene user, I make far more complicated decisions every day than 
deciding whether I want to use a Filter as a Query or not. If I pass it 
directly to IndexSearcher, I use it as a filter. If I use a different API which 
accepts just Query, I wrap it with CSQ. As simple as that.

But that's just my two cents.

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is 

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704522#action_12704522
 ] 

Shai Erera commented on LUCENE-1614:


I think I understand what you mean, but please correct me if I'm wrong. You 
propose this check() so that in case a DISI can save any extra operations it 
does in next() (such as reading a payload for example) it will do so. Therefore 
in the example you give above with CS, next()'s contract forces it to advance 
all the sub-scorers, but with check() it could stop in the middle.

This warrants an explicit documentation and implementation by current DISIs ... 
I don't think that if you call a DISI today with next(10) and next(10) it will 
not move to 11 in the second call. But calling check(10) and next(10) MUST not 
advance the DISI further than 10. If the default impl in DISI just uses 
nextDoc() and returns true if the return value is the requested, we should be 
safe back-compat-wise, but this is still dangerous and we need clear 
documentation.

BTW, perhaps a testAndSet-like version can save check(10) followed by a 
next(10), and will fit nicer?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704556#action_12704556
 ] 

Michael McCandless commented on LUCENE-1614:


bq. Just to clarify for myself, in the example I gave above, suppose thar the 
scorer is on 3 and you call check(8).

On check(8), TermScorer would go to 10, stop there, and return false.  (It 
would not rewind to 3).  Check can only be called on increasing arguments, so 
it's not truly random access.  It's forward only random access.

bq. You propose this check() so that in case a DISI can save any extra 
operations it does in next() (such as reading a payload for example) it will do 
so. Therefore in the example you give above with CS, next()'s contract forces 
it to advance all the sub-scorers, but with check() it could stop in the middle.

Precisely.

This is important when you have a super-cheap iterator (say a somewhat sparse 
(=10%?) in-memory filter that's represented as list-of-docIDs).  It's very 
fast for such a filter to iterate over its docIDs.  But when that iterator is 
AND'd with a Scorer, as is done today by IndexSearcher, they effectively play 
leap frog, where first it's the filter's turn to next(), then it's the 
Scorer's turn, etc.  But for the Scorer, next() can be extremely costly, only 
to find the filter doesn't accept it.  So for such situations it's better to 
let the filter drive the search, calling Scorer.check() on the docs.

But... once we switch to filter-as-BooleanClause, it's less clear whether 
check() is worthwhile, because I think the filter's constraint is more 
efficiently taken into account.

For filters that support random access (if they are less sparse, say = 25% or 
so), we should push them all the way down to the TermScorers and factor them in 
just like deletedDocs.

bq. . If the default impl in DISI just uses nextDoc() and returns true if the 
return value is the requested, we should be safe back-compat-wise, but this is 
still dangerous and we need clear documentation.

Yes it does have a good default impl, I think.

bq. BTW, perhaps a testAndSet-like version can save check(10) followed by a 
next(10), and will fit nicer?

Not sure what you mean by testAndSet-like version?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



What are we allowed to do in 3.0?

2009-04-30 Thread Shai Erera
Hi

Recently I was involved in several issues that required some runtime changes
to be done in 3.0 and it was not so clear what is it that we're actually
allowed to do in 3.0. So I'd like to open it for discussion, unless
everybody agree on it already.

So far I can tell that 3.0 allows us to:
1) Get rid of all the deprecated methods
2) Move to Java 1.5

But what about changes to runtime behavior, re-factoring a whole set of
classes etc? I'd like to relate to them one by one, so that we can comment
on each.

Removing deprecated methods
-
As I was told, 2.9 is the last release we are allowed to mark methods as
deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there
is a method that should be renamed, its signature should change or be
removed altogether, we can't just do it and we'd have to deprecate it and
remove it in 4.0 (?). I personally thought that 2.9 allows us to make these
changes without letting anyone know about them in advance, which I'm ok with
since upgrading to 3.0 is not going to be as 'smooth', but I also understand
why 'letting people know in advance' (which means a release prior to the one
when they are removed) gives a better service to our users. I also thought
that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0
(but I was corrected previously on that).
Which means upon releasing 2.9, the very first issue that's opened and
committed should be removing the current deprecated methods, or otherwise we
could open an issue that deprecates a method and accidentally remove it
later, when we handle the massive deprecation removal. We should also create
a 2.9 tag.

Changes to runtime behavior
-
What is exactly the policy here? If we document in 2.9 that certain
features' runtime behavior will change in 3.0 - is that ok to make those
changes? And if we don't document them and do it (in the transition from
2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0
to run all of his unit tests to assert that everything still works (I expect
that to happen for every release, but for 3.0 in particular). Obviously the
runtime behaviors that were changed and documented in 2.9 are ones that he
might have already taken care of, but why can't he do the same reading the
CHANGES of 3.0?
I just feel that this policy forces us to think really hard and foresee
those changes in runtime behavior that we'd like to do in 3.0 so that we can
get them into 2.9, but at the end of the day we're not improving anything.
Upon upgrading to 2.9 I cannot handle the changes in runtime behaviors as
they weren't done yet. I can only do it after I upgrade to 3.0. So what's
the difference for me between fixing those that were documented in 2.9 and
the new ones that were just released?

Going forward, I don't think this community changes runtime behavior every
other Monday, and so I'd like to have the ability to make those changes
without such a strict policy. Those changes are meant to help our users (and
we are amongst them) to achieve better performance, usually, and so why
should we fear from making them, or if fear is too strong a word - why
should we refrain from doing them, while documenting the changes? If we
don't want to do it for every 'dot' release, we can do them in major
releases and I'd also vote for doing them in a mid-major release, like 3.5.

Refactoring

Today we are quite limited with refactoring. We cannot add methods to
interfaces or abstract methods to abstract classes, or even make classes
abstract. I'm perfectly fine with that as I don't want to face the need to
suddenly refactor my application just because Lucene decided to add a method
to an interface.

But recently, while working on LUCENE-1593, me and Mike spotted a need to
add some methods to Weight, but since it is an interface we can't. So I said
something let's do it in 3.0 but then we were not sure if this can be done
in 3.0. So the alternative was let's deprecate Weight, create an
AbstractWeight class and do it there, but we weren't sure if that's even
something we can push for in 3.0, unless we do all of it in 2.9. This also
messes up the code, introducing new classes with bad names (AbstractWeight,
AbstractSearchable) where we could have avoided it if we just changed Weight
to an abstract class in 3.0.

---

I think it all boils down to whether we MUST support jar drop-in-ability
when upgrading from 2.9 to 3.0. I think that we shouldn't as the whole
notion of 3.0 (or any future major version) is major revision to code, index
structure, JDK etc. If we're always expected to support it, then 2.9 really
becomes 3.0 in terms of our ability to make changes to the API between
2.9-3.0. I'm afraid that if that's the case, we might choose to hold on with
2.9 as much as we can so we can push as many changes as we foresee into it,
so that they can be finalized in 3.0. I'm not 

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704559#action_12704559
 ] 

Michael McCandless commented on LUCENE-1593:


Another things we should improve about the Scorer API:

Enrich Scorer API to optionally provide more details on positions that
caused a match to occur.

This would improve highlighting (LUCENE-1522) since we'd know exactly
why a match occurred (single source) rather than trying to
reverse-engineer the match.

It'd also address a number of requests over time by users on how can
I get details on why this doc matched?.

I *think* if we did this, the *SpanQuery would be able to share much
more w/ their normal counterparts; this was discussed @
http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html.
Ie we would have a single TermQuery, just as efficient as the one
today, but it would expose a getMatches (say) that enumerates all
positions that matched.

Then, if one wanted these details for every hit on in the topN, we
could make an IndexReader impl that wraps TermVectors for the docs in
the topN (since TermVectors are basically a single-doc inverted
index), run the query on it, and request the match details per doc.


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #812

2009-04-30 Thread Michael McCandless
This was a false failure.

There's a timed test in contrib/benchmark that apparently can fail if
the machine happens to be swamped at the time.

I'll work out a more robust test.

Mike

On Wed, Apr 29, 2009 at 10:41 PM, Apache Hudson Server
hud...@hudson.zones.apache.org wrote:
 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/812/changes

 Changes:

 [pjaol] Fixed bug caused by multiSegmentIndexReader

 --
 [...truncated 11293 lines...]
 init:

 test:
     [echo] Building swing...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building swing...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test
    [junit] Testsuite: org.apache.lucene.swing.models.TestBasicList
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.555 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestBasicTable
    [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.584 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingList
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.627 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingTable
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.637 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingList
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.73 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingTable
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.007 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test/junitfailed.flag
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test
    [junit] Testsuite: 
 org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.356 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test/junitfailed.flag
     [echo] Building wordnet...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building xml-query-parser...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building xml-query-parser...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building xml-query-parser...

 build-queries:

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 common.compile-core:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test
    [junit] Testsuite: org.apache.lucene.xmlparser.TestParser
    [junit] Tests run: 18, Failures: 0, Errors: 0, Time elapsed: 2.215 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.xmlparser.TestQueryTemplateManager
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.463 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test/junitfailed.flag

 BUILD FAILED
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build.xml 
 :649: Contrib tests failed!

 Total time: 20 minutes 45 seconds
 Publishing Javadoc
 Recording test results
 Publishing Clover coverage report...


 

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704561#action_12704561
 ] 

Eks Dev commented on LUCENE-1518:
-

imo, it is really not all that important to make Filter and Query the same 
(that is just one alternative to achieve goal). 

Basic problem we try  to solve is adding Filter directly to BoolenQuery, and 
making optimizations after that easier. Wrapping with CSQ is just adding anothe 
layer between Lucene search machinery and Filter, making these optimizations 
harder.

On the other hand, I must accept, conceptually FIter and Query are the same, 
supporting together following options:
1. Pure boolean model: You do not care about scores (today we can do it only 
wia CSQ, as Filter does not enter BoolenQuery)
2. Mixed boolean and ranked: you have to define Filter contribution to the 
documents (CSQ)
3. Pure ranked: No filters, all gets scored (the same as 2.)

Ideally, as a user, I define only Query (Filter based or not) and for each 
clause in my Query define 
Query.setScored(true/false) or useConstantScore(double score); 

also I should be able to say, Dear Lucene please materialize this 
Query_Filter for me as I would like to have it cached and please store only 
DocIds (Filter today).  Maybe open possibility to open possibility to cache 
scores of the documents as well. 

one thing is concept  and another is optimization. From optimization point of 
view, we have couple of decisions to make:

- DocID Set supports random access, yes or no (my Materialized Query)
- Decide if clause should / should not be scored/ or should be constant

So, for each Query we need to decide/support:

- scoring{yes, no, constant} and 
- opening option to materialize Query (that is how we today create Filters 
today)
- these Materialized Queries (aka Filter) should be able to tell us if they 
support random access, if they cache only doc id's or scores as well


nothing usefull in this email, just  thinking aloud, sometimes helps :)






 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses 

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704566#action_12704566
 ] 

Shai Erera commented on LUCENE-1614:


bq. Not sure what you mean by testAndSet-like version?

I mean, instead of having the code call check(8), get true and then advance(8), 
just call checkAndAdvance(8) which returns true if 8 is supported and false 
otherwise, AND moves to 8. I don't propose to replace check() with it as 
sometimes you might want to check a couple of DISIs before making a decision to 
which doc to advance, but it could save calling advance() in case check() 
returns true.

bq. Yes it does have a good default impl, I think.

It _will_ have a good default impl, I can guarantee to try :). What I meant is 
that we should have clear documentation about check() and nextDoc() and the 
possibility that check will be called for doc Id 'X' and later nextDoc or 
advance will be called with 'X', in that case the impl must ensure 'X' is not 
skipped, as is done today by TermScorer for example.

So should I add this check()?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



R-trees in Lucene for spatio-textual search

2009-04-30 Thread Mukherjee, Prasenjit
Hi, 
   Has anybody recently used lucene to implement R-trees for
range-queries. I came across the GeoLucene project but not sure how
stable/efficient it is to use it for production. Any pointers in this
direction will be great. 
 
Thanks
-P


[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1593:
---

Attachment: LUCENE-1593.patch

Patch includes:
# New scoresDocsInOrder to Query
#* Default to false
#* Override in extensions to return true, except in BQ which still returns 
false until we resolve how BQ is used explicitly (top-score vs. not). In some 
queries that delegate the work, I used the delegatee or return true if all 
sub-queries return true.
# Changed TopFieldCollector and TopScoreDocCollector to take a 
docsScoredInOrder parameter and create the appropriate instance (breaking ties 
by doc Id or not).
# Added TestTopScoreDocCollector and a test case to TestSort which test 
out-of-order collection (they trigger the use of BooleanScorer, though whether 
document collection happens truly out of order I cannot tell).
# Updates to CHANGES

All tests pass, including test-tag. BTW, the patch also includes the fix to 
TestSort in tag, but without the fix for MultiSearcher and 
ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as 
well.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704590#action_12704590
 ] 

Yonik Seeley commented on LUCENE-1536:
--

Interesting stuff!
Has anyone tested if this results in a performance degradation on 
SegmentTermDocs?
This is very inner loop stuff, and it's replacing a non virtual 
BitVector.get() which can be easily inlined with two dispatches through base 
classes.  Hopefully hotspot could handle it, but it's tough to figure out, esp 
in a real system where sometimes a user RAF will be used and sometimes not.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704594#action_12704594
 ] 

Michael McCandless commented on LUCENE-1593:


bq. Not in this issue though, right?

Right: I'm back into the mode of throwing out all future improvements I know 
of, to help guide us in picking the right next step.  These would all be done 
in separate issues, and many of them would not be done today but still we 
should try not to preclude them for tomorrow.

{quote}
I like the idea of having Scorer be able to tell why a doc was matched. But I 
think we should make sure that if a user is not interested in this information, 
then he should not incur any overhead by it, such as aggregating information 
in-memory or doing any extra computations. Something like we've done for 
TopFieldCollector with tracking document scores and maxScore.
{quote}

Exactly, and I think/hope this'd be achievable.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: R-trees in Lucene for spatio-textual search

2009-04-30 Thread Uwe Schindler
R-trees are for spatial queries (two dimensions). If you want to have the
same for one-dimensional range queries, use TrieRangeQuery (see 2.9's
contrib queries package, this may move to core as a NumericRangeQuery etc.),
which is very stable and tested since years, but now to be included in
Lucene.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Mukherjee, Prasenjit [mailto:p.mukher...@corp.aol.com] 
Sent: Thursday, April 30, 2009 12:43 PM
To: java-dev@lucene.apache.org
Subject: R-trees in Lucene for spatio-textual search

 

Hi, 

   Has anybody recently used lucene to implement R-trees for range-queries.
I came across the GeoLucene project but not sure how stable/efficient it is
to use it for production. Any pointers in this direction will be great. 

 

Thanks

-P



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704597#action_12704597
 ] 

Shai Erera commented on LUCENE-1518:


bq. Wrapping with CSQ is just adding anothe layer between Lucene search 
machinery and Filter, making these optimizations harder.

Right. But making Filter sub-class Query and check in BQ 'if (query instanceof 
Filter) { Filter f = (Filter) query)' is not going to improve anything. It adds 
instanceof and casting, and I'd think those are more expensive than wrapping a 
Filter with CSQ and returning an appropriate Scorer, which will use the Filter 
in its next() and skipTo() calls.

bq. On the other hand, I must accept, conceptually FIter and Query are the 
same, supporting together following options

I think that if we allow BooleanClause to implement a Weight(IndexReader) (just 
like Query) we'll be one more step closer to that goal? BQ uses this method to 
construct BooleanWeight, only today it calls clause.getQuery().createWeight(). 
Instead it could do clause.getWeight, and if the BooleanClause holds a Filter 
it will return a FilterWeight, otherwise delegate that call to the contained 
Query.

Regarding pure ranked, CSQ is really what we need, no?

So how about the following:
# Add add(Filter, Occur) to BooleanClause.
# Add weight(Searcher) to BooleanClause.
# Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).
# Add isRandomAccess to Filter.
# Create a RandomAccessFilter which extends Filter and defines an additional 
seek(target) method.
# Add asRandomAccessFilter() to Filter, which will materialize that Filter into 
memory, or into another RandomAccess data structure (e.g. keeping it on disk 
but still provide random access to it, even if not very efficient) and return a 
RandomAccessFilter type, which will implement seek(target) and possibly 
override next() and skipTo(), but still use whatever other methods this Filter 
declares.
#* I think we should default it to throw UOE providing that we document that 
isRandomAccess should first be called.

I'm thinking out loud just like you, so I hope my stuff makes sense :).

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains 

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605
 ] 

Paul Elschot commented on LUCENE-1518:
--

How about materializing the DocIds _and_ the score values?

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704608#action_12704608
 ] 

Shai Erera commented on LUCENE-1593:


bq. BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one 
to ask the Query for either/or? 

I'm afraid this might mean a larger change. What will TermQuery do? Today it 
returns true, and does not have any implementation that can return docs 
out-of-order. So what should TQ do when outOfOrderScorer is called? Just return 
what inOrderScorer returns, or throw an exception?

That that there might be a Collector out there that requires docs in order is 
not something I think we should handle. Reason is, there wasn't any guarantee 
until today that docs are returned in order. So how can somehow write a 
Collector which has a hard assumption on that? Maybe only if he used a Query 
which he knows always scores in order, such as TQ, but then I don't think this 
guy will have a problem since TQ returns true.

And if that someone needs docs in order, but the query at hand returns docs out 
of order, then I'd say tough luck :)? I mean, maybe with BQ we can ensure 
in/out of order on request, but if there will be a query which returns docs in 
random, or based on other criteria which causes it to return out of order, what 
good will forcing it to return docs in order do? I'd say that you should just 
use a different query in that case?

bq. But I'm not sure in practice when one would want to use an out-of-order 
non-top iterator.

I agree. I think that iteration on Scorer is dictated to be in order because it 
extends DISI with next() and skipTo() methods which don't imply in any way they 
can return something out of order (besides next() maybe, but it will be hard to 
use such next() with a skipTo()).

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605
 ] 

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:19 AM:
---

bq. opening option to materialize Query (that is how we today create Filters 
today)

How about materializing the DocIds _and_ the score values?


  was (Author: paul.elsc...@xs4all.nl):
How about materializing the DocIds _and_ the score values?
  
 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704600#action_12704600
 ] 

Michael McCandless commented on LUCENE-1518:


Let's not forget we also have provides scores but NOT filtering type things 
as well, eg function queries, MatchAllDocsQuery, I want to boost documents by 
recency use case (which sort of a Scorer filter in that it takes another 
Scorer and modifies its output, per doc), etc.

It's just that very often the scoring part is in fact very much intertwined 
with the filtering part.  EG a TermQuery iterates a SegmentTermDocs, and 
reads  holds freq/doc in pairs.


 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
 ] 

Paul Elschot commented on LUCENE-1518:
--

bq. Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no 
scoring case is handled by not asking for score values.

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704603#action_12704603
 ] 

Michael McCandless commented on LUCENE-1593:


BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one to 
ask the Query for either/or?

Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just 
lower performance.

Sure, our top doc collectors can accept in order or out of order 
collection, but perhaps one has a collector out there that must get the docs in 
order, so shouldn't we be able to ask the Query to give us docs always in 
order or doesn't have to be in order?

Also: I wonder if we would ever want to allow for non-top-scorer usage that 
does not return docs in order?  Ie, next() would be allowed to yield docs out 
of order.  Obviously this is not allowed today... but we are now mixing top vs 
not-top with out-of-order vs in-order, where maybe they should be 
independent?  But I'm not sure in practice when one would want to use an 
out-of-order non-top iterator.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
 ] 

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:27 AM:
---

bq. Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no 
scoring case is handled by not asking for score values.
Using score=0.0f for no scoring might not work for BooleanQuery because it also 
has a coordination factor that depends on the number of matching sub_queries_. 
The patch at 1345 does not change that coordination factor for backward 
compatibility, even though the coordination factor might also depend on the 
number of a matching filter clauses.

  was (Author: paul.elsc...@xs4all.nl):
bq. Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no 
scoring case is handled by not asking for score values.
  
 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
 ] 

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:29 AM:
---

bq. Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no 
scoring case is handled by not asking for score values.
Using score=0.0f for no scoring might not work for BooleanQuery because it also 
has a coordination factor that depends on the number of matching queries 
clauses. The patch at 1345 does not change that coordination factor for 
backward compatibility, even though the coordination factor might also depend 
on the number of a matching filter clauses.

  was (Author: paul.elsc...@xs4all.nl):
bq. Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no 
scoring case is handled by not asking for score values.
Using score=0.0f for no scoring might not work for BooleanQuery because it also 
has a coordination factor that depends on the number of matching sub_queries_. 
The patch at 1345 does not change that coordination factor for backward 
compatibility, even though the coordination factor might also depend on the 
number of a matching filter clauses.
  
 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue 

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704613#action_12704613
 ] 

Eks Dev commented on LUCENE-1518:
-

Shai, 
Regarding pure ranked, CSQ is really what we need, no? --- 

Yep, it would work for Filters, but why not making it possible to have normal 
Query constant score. For these cases,  I am just not sure if this aproach 
gets max performance (did not look at this code for quite a while).  

Imagine you have a Query and you are not interested in Scoring at all, this can 
be acomplished with only DocID iterator arithmetic, ignoring  score() totally.  
But that is only an optimization (maybe allready there?)

Paul, 
How about materializing the DocIds _and_ the score values?
exactly,  that would open full caching posibility (original purpose of 
Filters).  Think Search Results caching ... that is practically another name 
for search() method. It is easy to create this, but using it again would 
require some bigger changes :) 

Filter_on_Steroids materialize(boolean without_score); 



 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704618#action_12704618
 ] 

Eks Dev commented on LUCENE-1518:
-

Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; 
the no scoring case is handled by not asking for score values...

Me: ...Imagine you have a Query and you are not interested in Scoring at all, 
this can be acomplished with only DocID iterator arithmetic, ignoring score() 
totally. But that is only an optimization (maybe allready there?)...

I knew Paul will kick in at this place, he sad exactly the same thing I did, 
but, as oposed to me, he made formulation that executes :) 
Pfff, I feel bad :)





 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704628#action_12704628
 ] 

Michael McCandless commented on LUCENE-1313:


{quote}
Perhaps the best way to make this clean is to keep the
ram merge policy and real dir merge policies different? That way
we don't merge policy implementations don't need to worry about
ram and non-ram dir cases.
{quote}

OK tentatively this feels like a good approach.  Would you re-use
MergePolicy, or make a new RAMMergePolicy?

Would we use the same MergeScheduler to then execute the selected
merges?

How would we handle the it's time to flush some RAM to disk case?
Would RAMMergePolicy make that decision?

bq. Perhaps an IW.updatePendingRamMerges method should be added that handles 
this separately?

Yes?

bq. Does the ram dir ever need to worry about things like 
maxNumSegmentsOptimize and optimize?

No?

{quote}
I think having the ram merge policy should cover the reasons I
had for having a separate ram writer. Although the IW.addWriter
method I implemented would not have blocked, but I don't think
it's necessary now if we have a separate ram merge policy.
{quote}

OK good.


 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704632#action_12704632
 ] 

Michael McCandless commented on LUCENE-1614:


bq. So should I add this check()?

Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently 
implement check().

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704631#action_12704631
 ] 

Michael McCandless commented on LUCENE-1614:


bq. I mean, instead of having the code call check(8), get true and then 
advance(8), just call checkAndAdvance(8) which returns true if 8 is supported 
and false otherwise, AND moves to 8.

Oh, sorry: that's in fact what I intended check to do.  But by moves to 8 
what it really means is you now cannot call check on anything  8 (maybe 9)

I think after check(N) is called, one cannot call doc() -- the results are not 
defined.  So check(N) logically puts the iterator at N, and you may at that 
point call next if you want, or call another check(M) but you cannot call doc() 
right after check.

bq. So should I add this check()?

I think so?  We can then do perf tests of that vs filter-as-BooleanClause?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704633#action_12704633
 ] 

Shai Erera commented on LUCENE-1614:


bq. I think after check(N) is called, one cannot call doc()

I think one cannot even call next(). If check(8) returns true, then you know 
that doc() will return 8 (otherwise it's a bug?). But if it returns false, it 
might be in 10 already, so calling next() will move it to 11 or something. So 
to be on the safe side, we should document that doc()'s result is unspecified 
if check() returns false, and next() is not recommended in that case, but 
skipTo() or check(M).

bq. Though, in order to run perf tests, we'd need the AND/OR scorers to 
efficiently implement check().

I plan to, as much as I can, efficiently implement nextDoc() and advance() in 
all Scorers/DISIs. So I can include check() in the list as well. Or .. maybe 
you know something I don't and you think this should deserve its own issue?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1607) String.intern() faster alternative

2009-04-30 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1607:
-

Attachment: LUCENE-1607.patch

bq. Yonik, the string is being interned twice in your latest patch 

Thanks - I had actually fixed that... but it didn't make it into the patch 
apparently :-)

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704640#action_12704640
 ] 

Yonik Seeley commented on LUCENE-1593:
--

docsInOrder() would be an implementation detail (and could actually vary per 
reader or per segment) and should be on the Scorer/DocIdSetIterator rather than 
the Query or Weight, right?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: What are we allowed to do in 3.0?

2009-04-30 Thread Michael McCandless
I think 3.0 should be a fast turnaround after 2.9?  Ie, no new
development should take place?  We should remove deprecated APIs,
change defaults, etc., but that's about it.  (I think this is how past
major releases worked?).  It's a fast switch.  Which then means we
need to do all the hard work in 2.9...

So, I think any API changes we want to make must be present in 2.9 as
deprecations.  We shouldn't up and remove / rename something in 3.0
with no fore-warning in 2.9.  Is there a case where this is too
painful?

Likewise I think we should give notification of expected changes in
runtime behavior with 2.9 (and not suddenly do them in 3.0).

 Which means upon releasing 2.9, the very first issue that's opened and 
 committed should be removing the current deprecated methods, or otherwise we 
 could open an issue that deprecates a method and accidentally remove it 
 later, when we handle the massive deprecation removal.

I think we should not target JAR drop-inability, and we should allow
changes to runtime behavior, as well as certain minor API changes in
3.0.  EG here are some of the changes already slated for 3.0:

  * IndexReader.open returns readOnly reader by default

  * IndexReader.norms returns null on fields that don't have norms

  * InterruptedException is thrown by many APIs

  * IndexWriter.autoCommit is hardwired to false

  * Things that now return the deprecated IndexCommitPoint (interface)
will be changed to return IndexCommit (abstract base class)

  * Directory.list will be removed; Directory.listAll will become an
abstract method

  * Stop tracking scores by default when sorting by field

 But recently, while working on LUCENE-1593, me and Mike spotted a need to add 
 some methods to Weight, but since it is an interface we can't. So I said 
 something let's do it in 3.0 but then we were not sure if this can be done 
 in 3.0.

I think for this we should probably introduce an abstract base class
(implementing Weight) in 2.9, stating that Weight interface will be
removed in 3.0?  (EG, this is what was done for
IndexCommitPoint/IndexCommit).  Simply changing Weight to be an
abstract class in 3.0 is spooky because Java is single inheritance, ie
for existing classes that implements Weight but subclass something
else it would be nicer to give a heads up with 2.9 that they'll need
to refactor?

Mike

On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 Recently I was involved in several issues that required some runtime changes
 to be done in 3.0 and it was not so clear what is it that we're actually
 allowed to do in 3.0. So I'd like to open it for discussion, unless
 everybody agree on it already.

 So far I can tell that 3.0 allows us to:
 1) Get rid of all the deprecated methods
 2) Move to Java 1.5

 But what about changes to runtime behavior, re-factoring a whole set of
 classes etc? I'd like to relate to them one by one, so that we can comment
 on each.

 Removing deprecated methods
 -
 As I was told, 2.9 is the last release we are allowed to mark methods as
 deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there
 is a method that should be renamed, its signature should change or be
 removed altogether, we can't just do it and we'd have to deprecate it and
 remove it in 4.0 (?). I personally thought that 2.9 allows us to make these
 changes without letting anyone know about them in advance, which I'm ok with
 since upgrading to 3.0 is not going to be as 'smooth', but I also understand
 why 'letting people know in advance' (which means a release prior to the one
 when they are removed) gives a better service to our users. I also thought
 that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0
 (but I was corrected previously on that).
 Which means upon releasing 2.9, the very first issue that's opened and
 committed should be removing the current deprecated methods, or otherwise we
 could open an issue that deprecates a method and accidentally remove it
 later, when we handle the massive deprecation removal. We should also create
 a 2.9 tag.

 Changes to runtime behavior
 -
 What is exactly the policy here? If we document in 2.9 that certain
 features' runtime behavior will change in 3.0 - is that ok to make those
 changes? And if we don't document them and do it (in the transition from
 2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0
 to run all of his unit tests to assert that everything still works (I expect
 that to happen for every release, but for 3.0 in particular). Obviously the
 runtime behaviors that were changed and documented in 2.9 are ones that he
 might have already taken care of, but why can't he do the same reading the
 CHANGES of 3.0?
 I just feel that this policy forces us to think really hard and foresee
 those changes in runtime behavior that we'd like to do in 3.0 so that we can
 get them 

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704665#action_12704665
 ] 

Michael McCandless commented on LUCENE-1614:


bq. I think one cannot even call next().

Hmm, yeah I think you're right.  We could perhaps make this an entirely 
different interface (abstract class).  Ie, one should not mix and match 
checking with next/advanceing.  In the case I can think of, at least, it's 
an up-front decision as to which scorer does next vs check.

bq.  So I can include check() in the list as well.

I think including it in this issue is fine.

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704674#action_12704674
 ] 

Yonik Seeley commented on LUCENE-1593:
--

Query objects are relatively abstract.  Weights are created only with respect 
to a Searcher, and Scorers are created only from within that context with 
respect to an IndexReader.  It really seems like we should maintain this 
separation and avoid putting implementation details into the Query object (or 
the Weight object for that matter).

bq. A user might want to know what Collector implementation to create before 
calling search(Query, Collector)

Having to create a certain type of collector sounds error prone.  
Why not reverse the flow of information and tell the Weight.scorer() method if 
an out-of-order scorer is acceptable via some flags or a context object.  This 
is also not backward compatible because Weight is an interface, so perhaps this 
optimization will just have to wait.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704692#action_12704692
 ] 

Michael McCandless commented on LUCENE-1593:


bq. What will TermQuery do?

Oh: it's fine to return an in-order scorer, always. It's just that if
a Query wants to use an out-of-order scorer, it should also implement
an in-order one.  Ie, there'd be a mating process to match the
scorer to the collector.

That that there might be a Collector out there that requires docs in order is 
not something I think we should handle. Reason is, there wasn't any guarantee 
until today that docs are returned in order. So how can somehow write a 
Collector which has a hard assumption on that? Maybe only if he used a Query 
which he knows always scores in order, such as TQ, but then I don't think this 
guy will have a problem since TQ returns true.

bq. And if that someone needs docs in order, but the query at hand returns docs 
out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure 
in/out of order on request, but if there will be a query which returns docs in 
random, or based on other criteria which causes it to return out of order, what 
good will forcing it to return docs in order do? I'd say that you should just 
use a different query in that case?

Well... we have to be careful.  EG say we had some great optimization
for iterating over matches to PhraseQuery, but it returned docs out of
order.  In that case, I think we'd preserve the in-order Scorer as
well?

bq. But I'm not sure in practice when one would want to use an out-of-order 
non-top iterator.

One case might be a random access filter AND'd w/ a BooleanQuery.  In
that case I could ask for a BooleanScorer to return a DISI whose next
is allowed to return docs out of order, because 1) my filter doesn't
care and 2) my collector doesn't care.

Though, we are thinking about pushing random access filters all the
way down to the TermScorer, so this is example isn't realistic in that
future... but it still feels like out of order iteration and I'm
top scorer or not are orthogonal concepts.


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704693#action_12704693
 ] 

Michael McCandless commented on LUCENE-1593:



One further optimization can be enabled if we can properly mate
out-of-orderness between Scorer  Collector: BooleanScorer could be
automatically used when appropriate.

Today, one must call BooleanQuery.setAllowDocsOutOfOrder which is
rather silly (it's very much an under the hood detail of how the
Scorer interacts w/ the Collector).  The vast majority of time it's
Lucene that creates the collector, and so now that we can create
Collectors that either do or do not care if docs arrive out of order,
we should allow BooleanScorer when we can.

Though that means we have two ways to score a BooleanQuery:

  * Use BooleanScorer2 w/ a Collector that doesn't fallback to docID
to break ties

  * Use BooleanScorer w/ a Collector that does fallback

We'd need to test which is most performant (I'm guessing the 2nd
one).

So maybe we should in fact add a acceptsDocsOutOfOrder to
Collector.


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704699#action_12704699
 ] 

Michael McCandless commented on LUCENE-1593:


bq. This is also not backward compatible because Weight is an interface, so 
perhaps this optimization will just have to wait.

Yonik would you suggest we migrate Weight to be an abstract class
instead?  (This is also being discussed in a separate thread on
java-dev, if you want to respond there...).


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704703#action_12704703
 ] 

Michael McCandless commented on LUCENE-1593:


Yonik does Solr have any Scorers that iterate on docs out of order?  Or is 
BooleanScorer the only one we all know about?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704702#action_12704702
 ] 

Michael McCandless commented on LUCENE-1593:


{quote}
IndexSearcher creates the Collector before it obtains a Scorer. Therefore all 
it has at hand is the Weight. Since Weight is an interface, we can't change it, 
so I added it to Query with a default of false.
{quote}

In early iterations on LUCENE-1483, we allowed Collector.setNextReader
to return a new Collector on the possibility that a new segment might
require different collector.

We could consider going back to that... and allowing the builtin
collectors to receive a Scorer on creation, which they could interact
with to figure out in/out of order types of issues.  We could then
also enrich setNextReader a bit to also receive a Scorer, so that if
somehow the Scorer for the next segment switched to be in-order vs
out-of-order, the Collector could properly respond.

Or we could require homogeneity for Scorer across all segments
(which'd be quite a bit simpler).

{quote}
Why not reverse the flow of information and tell the Weight.scorer() method if 
an out-of-order scorer is acceptable via some flags or a context object. This 
is also not backward compatible because Weight is an interface, so perhaps this 
optimization will just have to wait.
{quote}

I tentatively like this approach, ie add an API to Collector for it to
declare if it can handle out-of-order collection, and then ask for the
right Scorer.

But still internal creation of Collectors could go both ways, and so
we should retain the freedom to optimize (the BooleanScorer example
above).


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704705#action_12704705
 ] 

Michael McCandless commented on LUCENE-1593:


bq.  BooleanScorer could be automatically used when appropriate

If we do this (and I think we should -- good perf gains, though I haven't 
tested just how good, recently), then we should deprecate 
setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder).  
And make it clear that internally Lucene may ask for either scorer, depending 
on the collector.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present

2009-04-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704711#action_12704711
 ] 

Michael McCandless commented on LUCENE-1252:


Here's a simple example that might drive this issue forward:

   +h1n1 flu +united states

Ideally, to score this query, you'd want to first AND all 4 terms
together, and only for docs matching that, consult the positions of
each pair of terms.

But we fail to do this today.

It's like somehow Weight.scorer() needs to be able to return a
cheap and an expensive scorer (which must be AND'd).  I think
PhraseQuery would somehow return cheap/expensive scorers that under
the hood share the same SegmentTermDocs/Positions iterators, such that
after cheap.next() has run, cheap.expensive only needs to check the
current doc.  So in fact maybe the expensive scorer should not be a
Scorer but some other simple passes or doesn't API.

Or maybe it returns say a TwoStageScorer, which adds a
reallyPasses() (needs better name) method to otherwise normal the
Scorer (DISI) API.

Or something else

 Avoid using positions when not all required terms are present
 -

 Key: LUCENE-1252
 URL: https://issues.apache.org/jira/browse/LUCENE-1252
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Paul Elschot
Priority: Minor

 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
 currently next() and skipTo() will use position information even when other 
 parts of the query cannot match because some required terms are not present.
 This could be avoided by adding some methods to Scorer that relax the 
 postcondition of next() and skipTo() to something like all required terms 
 are present, but no position info was checked yet, and implementing these 
 methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
 SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704720#action_12704720
 ] 

Marvin Humphrey commented on LUCENE-1593:
-

I made Weight a subclass of Query and all of a sudden Searcher method 
signatures got easier to manage.

PS: Is this a good place to discuss why [having rambling conversations in the 
bug tracker is a bad idea|http://producingoss.com/en/bug-tracker-usage.html], 
or should I open a new issue?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704721#action_12704721
 ] 

Yonik Seeley commented on LUCENE-1593:
--

bq.  Yonik does Solr have any Scorers that iterate on docs out of order? Or is 
BooleanScorer the only one we all know about? 

Nope.  BooleanScorer is the only one I know about.  And it's sort of special 
too... it's not like BooleanScorer can accept out-of-order scorers as 
sub-scorers itself - the ids need to be delivered in the range of the current 
bucket.  IMO custom out-of-order scorers aren't supported in Lucene.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



dbsight

2009-04-30 Thread Michael Masters
Hello Everyone,

I just started to use lucene recently. Great project BTW. I was
wondering if anyone has suggested making an open source version of
dbsight (www.dbsight.net/). I've just started using it and I think it
would be awesome if it was open source. Does anyone know of a project
that's like this that is OS?

If not, then how can I propose a project that does a similar thing?

Thanks,
Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: dbsight

2009-04-30 Thread Jason Rutherglen
Hi Mike,

You may want to ask your question on java-u...@lucene.apache.org

-J

On Thu, Apr 30, 2009 at 11:59 AM, Michael Masters mmast...@gmail.comwrote:

 Hello Everyone,

 I just started to use lucene recently. Great project BTW. I was
 wondering if anyone has suggested making an open source version of
 dbsight (www.dbsight.net/). I've just started using it and I think it
 would be awesome if it was open source. Does anyone know of a project
 that's like this that is OS?

 If not, then how can I propose a project that does a similar thing?

 Thanks,
 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Resolved: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME

2009-04-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1611.


   Resolution: Fixed
Fix Version/s: 2.4.2

Thanks Christiaan!

 Do not launch new merges if IndexWriter has hit OOME
 

 Key: LUCENE-1611
 URL: https://issues.apache.org/jira/browse/LUCENE-1611
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.4.2, 2.9

 Attachments: LUCENE-1611-241.patch, LUCENE-1611.patch, 
 LUCENE-1611.patch


 if IndexWriter has hit OOME, it defends itself by refusing to commit changes 
 to the index, including merges.  But this can lead to infinite merge attempts 
 because we fail to prevent starting a merge.
 Spinoff from 
 http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: What are we allowed to do in 3.0?

2009-04-30 Thread Shai Erera
So I understand from your responses that you think that 2.9 should include
as much as possible, so that a user will have ~0 work upgrading from 2.9 to
3.0, assuming he upgraded fully to 2.9 (moved to use the not-deprecated APIs
etc.).
If 3.0 is supposed to be released quickly after 2.9 then this makes sense,
and leaves very small room for sudden major changes anyway.

Also I read that you do support introducing minor changes to the API as well
as runtime behvavior, but still prefer that we do them in 2.9. So
refactoring like we discussed - changing all interfaces to abstract classes
- should not happen in 2.9-3.0, which makes sense. I think anyway this type
of refactoring should happen one at a time, and not a complete overhaul of
the code.

Well ... I guess that if that's the case (releasing 3.0 soon after we
release 2.9, and introducing very minor and few changes to API and runtime
behavior that were not documented in 2.9 may be acceptable), then we should
be fine and this very short discussion (I somehow expected much more
responses) can end.

Thanks a lot for clarifying that to me.

Shai

On Thu, Apr 30, 2009 at 6:20 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 I think 3.0 should be a fast turnaround after 2.9?  Ie, no new
 development should take place?  We should remove deprecated APIs,
 change defaults, etc., but that's about it.  (I think this is how past
 major releases worked?).  It's a fast switch.  Which then means we
 need to do all the hard work in 2.9...

 So, I think any API changes we want to make must be present in 2.9 as
 deprecations.  We shouldn't up and remove / rename something in 3.0
 with no fore-warning in 2.9.  Is there a case where this is too
 painful?

 Likewise I think we should give notification of expected changes in
 runtime behavior with 2.9 (and not suddenly do them in 3.0).

  Which means upon releasing 2.9, the very first issue that's opened and
 committed should be removing the current deprecated methods, or otherwise we
 could open an issue that deprecates a method and accidentally remove it
 later, when we handle the massive deprecation removal.

 I think we should not target JAR drop-inability, and we should allow
 changes to runtime behavior, as well as certain minor API changes in
 3.0.  EG here are some of the changes already slated for 3.0:

  * IndexReader.open returns readOnly reader by default

  * IndexReader.norms returns null on fields that don't have norms

  * InterruptedException is thrown by many APIs

  * IndexWriter.autoCommit is hardwired to false

  * Things that now return the deprecated IndexCommitPoint (interface)
will be changed to return IndexCommit (abstract base class)

  * Directory.list will be removed; Directory.listAll will become an
abstract method

  * Stop tracking scores by default when sorting by field

  But recently, while working on LUCENE-1593, me and Mike spotted a need to
 add some methods to Weight, but since it is an interface we can't. So I said
 something let's do it in 3.0 but then we were not sure if this can be done
 in 3.0.

 I think for this we should probably introduce an abstract base class
 (implementing Weight) in 2.9, stating that Weight interface will be
 removed in 3.0?  (EG, this is what was done for
 IndexCommitPoint/IndexCommit).  Simply changing Weight to be an
 abstract class in 3.0 is spooky because Java is single inheritance, ie
 for existing classes that implements Weight but subclass something
 else it would be nicer to give a heads up with 2.9 that they'll need
 to refactor?

 Mike

 On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Recently I was involved in several issues that required some runtime
 changes
  to be done in 3.0 and it was not so clear what is it that we're actually
  allowed to do in 3.0. So I'd like to open it for discussion, unless
  everybody agree on it already.
 
  So far I can tell that 3.0 allows us to:
  1) Get rid of all the deprecated methods
  2) Move to Java 1.5
 
  But what about changes to runtime behavior, re-factoring a whole set of
  classes etc? I'd like to relate to them one by one, so that we can
 comment
  on each.
 
  Removing deprecated methods
  -
  As I was told, 2.9 is the last release we are allowed to mark methods as
  deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel
 there
  is a method that should be renamed, its signature should change or be
  removed altogether, we can't just do it and we'd have to deprecate it and
  remove it in 4.0 (?). I personally thought that 2.9 allows us to make
 these
  changes without letting anyone know about them in advance, which I'm ok
 with
  since upgrading to 3.0 is not going to be as 'smooth', but I also
 understand
  why 'letting people know in advance' (which means a release prior to the
 one
  when they are removed) gives a better service to our users. I also
 thought
  that jar drop-in-ability is not 

[jira] Updated: (LUCENE-1313) Realtime Search

2009-04-30 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1313:
-

Attachment: LUCENE-1313.patch

{quote} Would you re-use MergePolicy, or make a new
RAMMergePolicy? {quote}

MergePolicy is used as is with a special IW method that handles
merging ram segments for the real directory (which has an issue
around merging contiguous segments, can that be relaxed in this
case as I don't understand why this is?)

The patch is not committable, however I am posting it to show a
path that seems to work. It includes test cases for merging in
ram and merging to the real directory.

* IW.getFlushDirectory is used by internal calls to obtain the
directory to flush segments to. This is used in DocumentsWriter
related calls.

* DocumentsWriter.directory is removed so that methods requiring
the directory call IW.getFlushDirectory instead.

* IW.setRAMDirectory sets the ram directory to be used.

* IW.setRAMMergePolicy sets the merge policy to be used for
merging segments on the ram dir.

* In IW.updatePendingMerges totalRamUsed is the size of the ram
segments + the ram buffer used. If totalRamUsed exceeds the max
ram buffer size then IW. updatePendingRamMergesToRealDir is
called.

* IW. updatePendingRamMergesToRealDir registers a merge of the
ram segments to the real directory (currently causes a
non-contiguous segments exception)

* MergePolicy.OneMerge has a directory attribute used when
building the merge.info in _mergeInit.

* Test case includes testMergeInRam, testMergeToDisk,
testMergeRamExceeded

There is one error that occurs regularly in testMergeRamExceeded
{code} MergePolicy selected non-contiguous segments to merge
(_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1-_bj _bp:cx1-_bp _bq:cx1-_bp
_c2:cx1-_c2 _c3:cx1-_c2 _c4:cx1-_c2 vs _5x:c120 _6a:c8
_6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1-_bj**
_bp:cx1-_bp** _bq:cx1-_bp** _c1:c10 _c2:cx1-_c2**
_c3:cx1-_c2** _c4:cx1-_c2**), which IndexWriter (currently)
cannot handle {code} 

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704785#action_12704785
 ] 

Shai Erera commented on LUCENE-1593:


bq. add an API to Collector for it to declare if it can handle out-of-order 
collection, and then ask for the right Scorer.

Maybe instead add docsOrderSupportedMode() which returns IN_ORDER, 
OUT_OF_ORDER, DONT_CARE? I.e., instead of a boolean allow a Collector to say I 
don't really care (like Mike has pointed out, I think, somewhere above) and 
let the Scorer creation code decide which one to create in case it knows any 
better. I.e., if we know that BS performs better than BS2, and we get a 
Collector saying DONT_CARE, we can always return BS.
Unless we assume that OUT_OF_ORDER covers DONT_CARE either, in which case we 
can leave it as returning boolean and document that if a Collector can support 
OUT_OF_ORDER, it should always say so, giving the Scorer creator code a chance 
to decide what is the best Scorer to return.

In IndexSearcher we can then:
# Where Collector is given as argument, ask it if it about orderness and create 
the appropriate Scorer.
# Where we create our own Collector (i.e. TFC and TSDC) decide on our own what 
is better. Maybe always ask out-of-order? That way a Query which doesn't only 
supports in-order without any optimization for out-of-order can return that 
in-order collector. I didn't think of it initially, but Mike is right - every 
in-order scorer is also an out-of-order scorer, so this should be fine.

I like the approach of deprecating Weight and creating an abstract class, 
though that requires deprecating Searchable and creating an AbstractSearchable 
as well. Weight can be wrapped with an AbstractWeightWrapper and passed to the 
AbstractWeight methods (much like we do with AbstractHitCollector from 
LUCENE-1575), defaulting its scorer(inOrder) method to call scorer()?

This I guess should be done in the scope of that issue, or I revert the changes 
done to Query (adding scoresDocsInOrder()), but keep those done to TFC and 
TSDC, and make that optimization in a different issue, which will handle 
Weight/Searchable and the rest of the changes proposed here?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Score calculation with new by-segment collection

2009-04-30 Thread Earwin Burrfoot
Did I miss something, or when trunk switched to collecting on
SegmentReaders we've lost proper scores?
I mean, before score depended on TF calculated across all the index,
and now it depends on TF for a given segment (yup, unless I missed
something).
Per-segment TF can vary wildly, especially in case of smaller segments.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Score calculation with new by-segment collection

2009-04-30 Thread Yonik Seeley
On Thu, Apr 30, 2009 at 4:44 PM, Earwin Burrfoot ear...@gmail.com wrote:
 Did I miss something, or when trunk switched to collecting on
 SegmentReaders we've lost proper scores?
 I mean, before score depended on TF calculated across all the index,
 and now it depends on TF for a given segment (yup, unless I missed
 something).
 Per-segment TF can vary wildly, especially in case of smaller segments.

tf is per-document, not per index.
idf is per index, and is calculated in the creation of Weight at the
top-level index reader.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Score calculation with new by-segment collection

2009-04-30 Thread Earwin Burrfoot
On Fri, May 1, 2009 at 00:47, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 30, 2009 at 4:44 PM, Earwin Burrfoot ear...@gmail.com wrote:
 Did I miss something, or when trunk switched to collecting on
 SegmentReaders we've lost proper scores?
 I mean, before score depended on TF calculated across all the index,
 and now it depends on TF for a given segment (yup, unless I missed
 something).
 Per-segment TF can vary wildly, especially in case of smaller segments.
 tf is per-document, not per index. idf is per index,
Yup, my bad.

 and is calculated in the creation of Weight at the top-level index reader.
Aha, thanks a lot.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: dbsight

2009-04-30 Thread Michael Masters
Sorry. My mistake.

-Mike

On Thu, Apr 30, 2009 at 1:22 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Hi Mike,

 You may want to ask your question on java-u...@lucene.apache.org

 -J

 On Thu, Apr 30, 2009 at 11:59 AM, Michael Masters mmast...@gmail.com
 wrote:

 Hello Everyone,

 I just started to use lucene recently. Great project BTW. I was
 wondering if anyone has suggested making an open source version of
 dbsight (www.dbsight.net/). I've just started using it and I think it
 would be awesome if it was open source. Does anyone know of a project
 that's like this that is OS?

 If not, then how can I propose a project that does a similar thing?

 Thanks,
 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1313) Realtime Search

2009-04-30 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1313:
-

Attachment: LUCENE-1313.patch

Fixed and cleaned up more.

All tests pass

Added entry in CHANGES.txt

I'm going to integrate LUCENE-1618 and test that out as a part of the next 
patch.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

2009-04-30 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1494:
-

Attachment: LUCENE-1494-masking.patch

some things looked like they wouldn't work with the masking patch, so i wrote 
some test cases to convince myself they were broken (and because new code 
should always have test cases).  In particular i was worried about the lack of 
equals/hashCode methods, and the broken rewrite method

one interesting thing I discovered was that the code worked in many cases even 
though rewrite was constantly just returning the masked inner query -- digging 
into it i realized the reason was because none of the other SpanQuery classes 
pay any attention to what their nested clauses return when they recursively 
rewrite, so a SpanNearQuery whose constructor freaks out if the fields of all 
the clauses don't match, happily generates spans if one of those clauses 
returns a complteley different SpanQuery on rewrite.

I also removed the proxying of getBoost and setBoost ... it was causing 
problems with some stock testing framework code that expects a 
q1.equals(q1.clone().setBoost(newBoost)) to be false (this was evaluating to 
true because it's a shallow clone and setBoost was proxying and modifying the 
original inner query's boost value) ... this means that FieldMaskingSpanQuery 
is consistent with how other SpanQueries deal with boost (they ignore the 
boosts of their nested clauses)

new patch (with tests) attached ... i'd like to have some more tests before 
committing (spans is deep voodoo, we're doing funky stuff with spans, all the 
more reason to test thoroughly)

 Additional features for searching for value across multiple fields 
 (many-to-one style)
 --

 Key: LUCENE-1494
 URL: https://issues.apache.org/jira/browse/LUCENE-1494
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4
Reporter: Paul Cowan
Priority: Minor
 Attachments: LUCENE-1494-masking.patch, LUCENE-1494-masking.patch, 
 LUCENE-1494-multifield.patch, LUCENE-1494-positionincrement.patch


 This issue is to cover the changes required to do a search across multiple 
 fields with the same name in a fashion similar to a many-to-one database. 
 Below is my post on java-dev on the topic, which details the changes we need:
 ---
 We have an interesting situation where we are effectively indexing two 
 'entities' in our system, which share a one-to-many relationship (imagine 
 'User' and 'Delivery Address' for demonstration purposes). At the moment, we 
 index one Lucene Document per 'many' end, duplicating the 'one' end data, 
 like so:
 userid: 1
 userfirstname: fred
 addresscountry: au
 addressphone: 1234
 userid: 1
 userfirstname: fred
 addresscountry: nz
 addressphone: 5678
 userid: 2
 userfirstname: mary
 addresscountry: au
 addressphone: 5678
 (note: 2 Documents indexed for user 1). This is somewhat annoying for us, 
 because when we search in Lucene the results we want back (conceptually) are 
 at the 'user' level, so we have to collapse the results by distinct user id, 
 etc. etc (let alone that it blows out the size of our index enormously). So 
 why do we do it? It would make more sense to use multiple fields:
 userid: 1
 userfirstname: fred
 addresscountry: au
 addressphone: 1234
 addresscountry: nz
 addressphone: 5678
 userid: 2
 userfirstname: mary
 addresscountry: au
 addressphone: 5678
 But imagine the search +addresscountry:au +addressphone:5678. We'd like 
 this to match ONLY Mary, but of course it matches Fred also because he 
 matches both those terms (just for different addresses).
 There are two aspects to the approach we've (more or less) got working but 
 I'd like to run them past the group and see if they're worth trying to get 
 them into Lucene proper (if so, I'll create a JIRA issue for them)
 1) Use a modified SpanNearQuery. If we assume that country + phone will 
 always be one token, we can rely on the fact that the positions of 'au' and 
 '5678' in Fred's document will be different.
SpanQuery q1 = new SpanTermQuery(new Term(addresscountry, au));
SpanQuery q2 = new SpanTermQuery(new Term(addressphone, 5678));
SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
 the slop of 0 means that we'll only return those where the two terms are in 
 the same position in their respective fields. This works brilliantly, BUT 
 requires a change to SpanNearQuery's constructor (which checks that all the 
 clauses are against the same field). Are people amenable to perhaps adding 
 another constructor to SNQ which doesn't do the check, or subclassing it to 
 do the same (give it a 

Hudson build is back to normal: Lucene-trunk #813

2009-04-30 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/813/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org