date:20090430

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704499#action_12704499
]

Shai Erera commented on LUCENE-1518:

I would like to query why do we need to make Filter and Query of the same type?
After all, they both do different things, even though it looks like they are
similar. Attempting to do this yields those peculiarities:
# If Filter extends Query, it now has to implement all sorts of methods like
weight, toString, rewrite, getTerms and scoresDocInOrder (an addition from
LUCENE-1593).
# If Query extends Filter, it has to implement getDocIdSet.
# Introduce instanceof checks in places just to check if a given Query is
actually a Filter or not.

Both (1) and (2) are completely redundant for both Query and Filter, i.e. why
should Filter implement toString(term) or scoresDocInOrder when it does score
docs? Why should Query implement getDocIdSet when it already implements a
weight().scorer() which returns a DocIdSetIterator?

I read the different posts on this issue and I don't understand why we think
that the API is not clear enough today, or is not convenient:

* If I want to just filter the entire index, I have two ways: (1) execute a
search with MatchAllDocsQuery and a Filter (2) Wrap a filter with
ConstantScoreQuery. I don't see the difference between the two, and I don't
think it forces any major/difficult decision on the user.
* If I want to have a BooleanQuery with several clauses and I want a clause to
be a complex one with a Filter, I can wrap the Filter with CSQ.
* If I want to filter a Query, there is already API today on Searcher which
accepts both Query and Filter.

At least as I understand it, Queries are supposed to score documents, while
Filters to just filter. If there is an API which requires Queries only, then I
can wrap my Filter with CSQ, but I'd prefer to check if we can change that API
first (for example, allowing BooleanClause to accept a Filter, and implement a
weight(IndexReader) rather than just getQuery()).

So if Filters just filter and Queries just score, the API on both is very
clear: Filter returns a DISI and Query returns a Scorer (which is also a DISI).
I don't see the advantage of having the code unaware to the fact a certain
Query is actually a Fitler - I prefer it to be upfront. That way, we can do all
sorts of optimizations, like asking the Filter for next() first, if we know
it's supposed to filter most of the documents.

At the end of the day, both Filter and Query iterate on documents. The
difference lies in the purpose of iteration. In my code there are several Query
implementations today that just filter documents, and I plan to change all of
them to implement Filter instead (that was originally the case because Filter
had just bits() and now it's more efficient with the iterator() version, at
least to me). I want to do this for a couple of reasons, clarity being one of
the most important. If Filter just filters, I don't see why it should inherit
all the methods from Query (or vice versa BTW), especially when I have this CSQ
wrapper.
To me, as a Lucene user, I make far more complicated decisions every day than
deciding whether I want to use a Filter as a Query or not. If I pass it
directly to IndexSearcher, I use it as a filter. If I use a different API which
accepts just Query, I wrap it with CSQ. As simple as that.

But that's just my two cents.

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704522#action_12704522
 ] 

Shai Erera commented on LUCENE-1614:


I think I understand what you mean, but please correct me if I'm wrong. You 
propose this check() so that in case a DISI can save any extra operations it 
does in next() (such as reading a payload for example) it will do so. Therefore 
in the example you give above with CS, next()'s contract forces it to advance 
all the sub-scorers, but with check() it could stop in the middle.

This warrants an explicit documentation and implementation by current DISIs ... 
I don't think that if you call a DISI today with next(10) and next(10) it will 
not move to 11 in the second call. But calling check(10) and next(10) MUST not 
advance the DISI further than 10. If the default impl in DISI just uses 
nextDoc() and returns true if the return value is the requested, we should be 
safe back-compat-wise, but this is still dangerous and we need clear 
documentation.

BTW, perhaps a testAndSet-like version can save check(10) followed by a 
next(10), and will fit nicer?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704556#action_12704556
 ] 

Michael McCandless commented on LUCENE-1614:


bq. Just to clarify for myself, in the example I gave above, suppose thar the 
scorer is on 3 and you call check(8).

On check(8), TermScorer would go to 10, stop there, and return false.  (It 
would not rewind to 3).  Check can only be called on increasing arguments, so 
it's not truly random access.  It's forward only random access.

bq. You propose this check() so that in case a DISI can save any extra 
operations it does in next() (such as reading a payload for example) it will do 
so. Therefore in the example you give above with CS, next()'s contract forces 
it to advance all the sub-scorers, but with check() it could stop in the middle.

Precisely.

This is important when you have a super-cheap iterator (say a somewhat sparse 
(=10%?) in-memory filter that's represented as list-of-docIDs).  It's very 
fast for such a filter to iterate over its docIDs.  But when that iterator is 
AND'd with a Scorer, as is done today by IndexSearcher, they effectively play 
leap frog, where first it's the filter's turn to next(), then it's the 
Scorer's turn, etc.  But for the Scorer, next() can be extremely costly, only 
to find the filter doesn't accept it.  So for such situations it's better to 
let the filter drive the search, calling Scorer.check() on the docs.

But... once we switch to filter-as-BooleanClause, it's less clear whether 
check() is worthwhile, because I think the filter's constraint is more 
efficiently taken into account.

For filters that support random access (if they are less sparse, say = 25% or 
so), we should push them all the way down to the TermScorers and factor them in 
just like deletedDocs.

bq. . If the default impl in DISI just uses nextDoc() and returns true if the 
return value is the requested, we should be safe back-compat-wise, but this is 
still dangerous and we need clear documentation.

Yes it does have a good default impl, I think.

bq. BTW, perhaps a testAndSet-like version can save check(10) followed by a 
next(10), and will fit nicer?

Not sure what you mean by testAndSet-like version?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

What are we allowed to do in 3.0?

2009-04-30 Thread Shai Erera

Hi

Recently I was involved in several issues that required some runtime changes
to be done in 3.0 and it was not so clear what is it that we're actually
allowed to do in 3.0. So I'd like to open it for discussion, unless
everybody agree on it already.

So far I can tell that 3.0 allows us to:
1) Get rid of all the deprecated methods
2) Move to Java 1.5

But what about changes to runtime behavior, re-factoring a whole set of
classes etc? I'd like to relate to them one by one, so that we can comment
on each.

Removing deprecated methods
-
As I was told, 2.9 is the last release we are allowed to mark methods as
deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there
is a method that should be renamed, its signature should change or be
removed altogether, we can't just do it and we'd have to deprecate it and
remove it in 4.0 (?). I personally thought that 2.9 allows us to make these
changes without letting anyone know about them in advance, which I'm ok with
since upgrading to 3.0 is not going to be as 'smooth', but I also understand
why 'letting people know in advance' (which means a release prior to the one
when they are removed) gives a better service to our users. I also thought
that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0
(but I was corrected previously on that).
Which means upon releasing 2.9, the very first issue that's opened and
committed should be removing the current deprecated methods, or otherwise we
could open an issue that deprecates a method and accidentally remove it
later, when we handle the massive deprecation removal. We should also create
a 2.9 tag.

Changes to runtime behavior
-
What is exactly the policy here? If we document in 2.9 that certain
features' runtime behavior will change in 3.0 - is that ok to make those
changes? And if we don't document them and do it (in the transition from
2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0
to run all of his unit tests to assert that everything still works (I expect
that to happen for every release, but for 3.0 in particular). Obviously the
runtime behaviors that were changed and documented in 2.9 are ones that he
might have already taken care of, but why can't he do the same reading the
CHANGES of 3.0?
I just feel that this policy forces us to think really hard and foresee
those changes in runtime behavior that we'd like to do in 3.0 so that we can
get them into 2.9, but at the end of the day we're not improving anything.
Upon upgrading to 2.9 I cannot handle the changes in runtime behaviors as
they weren't done yet. I can only do it after I upgrade to 3.0. So what's
the difference for me between fixing those that were documented in 2.9 and
the new ones that were just released?

Going forward, I don't think this community changes runtime behavior every
other Monday, and so I'd like to have the ability to make those changes
without such a strict policy. Those changes are meant to help our users (and
we are amongst them) to achieve better performance, usually, and so why
should we fear from making them, or if fear is too strong a word - why
should we refrain from doing them, while documenting the changes? If we
don't want to do it for every 'dot' release, we can do them in major
releases and I'd also vote for doing them in a mid-major release, like 3.5.

Refactoring

Today we are quite limited with refactoring. We cannot add methods to
interfaces or abstract methods to abstract classes, or even make classes
abstract. I'm perfectly fine with that as I don't want to face the need to
suddenly refactor my application just because Lucene decided to add a method
to an interface.

But recently, while working on LUCENE-1593, me and Mike spotted a need to
add some methods to Weight, but since it is an interface we can't. So I said
something let's do it in 3.0 but then we were not sure if this can be done
in 3.0. So the alternative was let's deprecate Weight, create an
AbstractWeight class and do it there, but we weren't sure if that's even
something we can push for in 3.0, unless we do all of it in 2.9. This also
messes up the code, introducing new classes with bad names (AbstractWeight,
AbstractSearchable) where we could have avoided it if we just changed Weight
to an abstract class in 3.0.

---

I think it all boils down to whether we MUST support jar drop-in-ability
when upgrading from 2.9 to 3.0. I think that we shouldn't as the whole
notion of 3.0 (or any future major version) is major revision to code, index
structure, JDK etc. If we're always expected to support it, then 2.9 really
becomes 3.0 in terms of our ability to make changes to the API between
2.9-3.0. I'm afraid that if that's the case, we might choose to hold on with
2.9 as much as we can so we can push as many changes as we foresee into it,
so that they can be finalized in 3.0. I'm not

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704559#action_12704559
]

Michael McCandless commented on LUCENE-1593:

Another things we should improve about the Scorer API:

Enrich Scorer API to optionally provide more details on positions that
caused a match to occur.

This would improve highlighting (LUCENE-1522) since we'd know exactly
why a match occurred (single source) rather than trying to
reverse-engineer the match.

It'd also address a number of requests over time by users on how can
I get details on why this doc matched?.

I *think* if we did this, the *SpanQuery would be able to share much
more w/ their normal counterparts; this was discussed @
http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html.
Ie we would have a single TermQuery, just as efficient as the one
today, but it would expose a getMatches (say) that enumerates all
positions that matched.

Then, if one wanted these details for every hit on in the topN, we
could make an IndexReader impl that wraps TermVectors for the docs in
the topN (since TermVectors are basically a single-doc inverted
index), run the query on it, and request the match details per doc.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, PerfTest.java

This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code
to remove unnecessary checks. The plan is:
# Ensure that IndexSearcher returns segements in increasing doc Id order,
instead of numDocs().
# Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs
will always have larger ids and therefore cannot compete.
# Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF)
and remove the check if reusableSD == null.
# Also move to use changing top and then call adjustTop(), in case we
update the queue.
# some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker
for the last SortField. But, doing so should not be necessary (since we
already break ties by docID), and is in fact less efficient (once the above
optimization is in).
# Investigate PQ - can we deprecate insert() and have only
insertWithOverflow()? Add a addDummyObjects method which will populate the
queue without arranging it, just store the objects in the array (this can
be used to pre-populate sentinel values)?
I will post a patch as well as some perf measurements as soon as I have them.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Build failed in Hudson: Lucene-trunk #812

2009-04-30 Thread Michael McCandless

This was a false failure.

There's a timed test in contrib/benchmark that apparently can fail if
the machine happens to be swamped at the time.

I'll work out a more robust test.

Mike

On Wed, Apr 29, 2009 at 10:41 PM, Apache Hudson Server
hud...@hudson.zones.apache.org wrote:
 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/812/changes

 Changes:

 [pjaol] Fixed bug caused by multiSegmentIndexReader

 --
 [...truncated 11293 lines...]
 init:

 test:
     [echo] Building swing...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building swing...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test
    [junit] Testsuite: org.apache.lucene.swing.models.TestBasicList
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.555 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestBasicTable
    [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.584 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingList
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.627 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingTable
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.637 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingList
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.73 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingTable
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.007 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test/junitfailed.flag
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building wikipedia...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test
    [junit] Testsuite: 
 org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.356 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test/junitfailed.flag
     [echo] Building wordnet...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building xml-query-parser...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 test:
     [echo] Building xml-query-parser...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 compile-test:
     [echo] Building xml-query-parser...

 build-queries:

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 common.compile-core:

 compile-core:

 common.compile-test:

 common.test:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test
    [junit] Testsuite: org.apache.lucene.xmlparser.TestParser
    [junit] Tests run: 18, Failures: 0, Errors: 0, Time elapsed: 2.215 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.xmlparser.TestQueryTemplateManager
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.463 sec
    [junit]
   [delete] Deleting: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test/junitfailed.flag

 BUILD FAILED
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build.xml 
 :649: Contrib tests failed!

 Total time: 20 minutes 45 seconds
 Publishing Javadoc
 Recording test results
 Publishing Clover coverage report...

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704561#action_12704561
]

Eks Dev commented on LUCENE-1518:
-

imo, it is really not all that important to make Filter and Query the same
(that is just one alternative to achieve goal).

Basic problem we try to solve is adding Filter directly to BoolenQuery, and
making optimizations after that easier. Wrapping with CSQ is just adding anothe
layer between Lucene search machinery and Filter, making these optimizations
harder.

On the other hand, I must accept, conceptually FIter and Query are the same,
supporting together following options:
1. Pure boolean model: You do not care about scores (today we can do it only
wia CSQ, as Filter does not enter BoolenQuery)
2. Mixed boolean and ranked: you have to define Filter contribution to the
documents (CSQ)
3. Pure ranked: No filters, all gets scored (the same as 2.)

Ideally, as a user, I define only Query (Filter based or not) and for each
clause in my Query define
Query.setScored(true/false) or useConstantScore(double score);

also I should be able to say, Dear Lucene please materialize this
Query_Filter for me as I would like to have it cached and please store only
DocIds (Filter today). Maybe open possibility to open possibility to cache
scores of the documents as well.

one thing is concept and another is optimization. From optimization point of
view, we have couple of decisions to make:

- DocID Set supports random access, yes or no (my Materialized Query)
- Decide if clause should / should not be scored/ or should be constant

So, for each Query we need to decide/support:

- scoring{yes, no, constant} and
- opening option to materialize Query (that is how we today create Filters
today)
- these Materialized Queries (aka Filter) should be able to tell us if they
support random access, if they cache only doc id's or scores as well

nothing usefull in this email, just thinking aloud, sometimes helps :)

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

This issue presents a patch, that merges Queries and Filters in a way, that
the new Filter class extends Query. This would make it possible, to use every
filter as a query.
The new abstract filter class would contain all methods of
ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the
Filter's getDocIdSet()/bits() methods he has nothing more to do, he could
just use the filter as a normal query.
I do not want to completely convert Filters to ConstantScoreQueries. The idea
is to combine Queries and Filters in such a way, that every Filter can
automatically be used at all places where a Query can be used (e.g. also
alone a search query without any other constraint). For that, the abstract
Query methods must be implemented and return a default weight for Filters
which is the current ConstantScore Logic. If the filter is used as a real
filter (where the API wants a Filter), the getDocIdSet part could be directly
used, the weight is useless (as it is currently, too). The constant score
default implementation is only used when the Filter is used as a Query (e.g.
as direct parameter to Searcher.search()). For the special case of
BooleanQueries combining Filters and Queries the idea is, to optimize the
BooleanQuery logic in such a way, that it detects if a BooleanClause is a
Filter (using instanceof) and then directly uses the Filter API and not take
the burden of the ConstantScoreQuery (see LUCENE-1345).
Here some ideas how to implement Searcher.search() with Query and Filter:
- User runs Searcher.search() using a Filter as the only parameter. As every
Filter is also a ConstantScoreQuery, the query can be executed and returns
score 1.0 for all matching documents.
- User runs Searcher.search() using a Query as the only parameter: No change,
all is the same as before
- User runs Searcher.search() using a BooleanQuery as parameter: If the
BooleanQuery does not contain a Query that is subclass of Filter (the new
Filter) everything as usual. If the BooleanQuery only contains exactly one
Filter and nothing else the Filter is used as a constant score query. If
BooleanQuery contains clauses with Queries and Filters the new algorithm
could be used: The queries are executed and the results filtered with the
filters.
For the user this has the main advantage: That he can construct his query
using a simplified API without thinking about Filters oder Queries, you can
just combine clauses

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704566#action_12704566
 ] 

Shai Erera commented on LUCENE-1614:


bq. Not sure what you mean by testAndSet-like version?

I mean, instead of having the code call check(8), get true and then advance(8), 
just call checkAndAdvance(8) which returns true if 8 is supported and false 
otherwise, AND moves to 8. I don't propose to replace check() with it as 
sometimes you might want to check a couple of DISIs before making a decision to 
which doc to advance, but it could save calling advance() in case check() 
returns true.

bq. Yes it does have a good default impl, I think.

It _will_ have a good default impl, I can guarantee to try :). What I meant is 
that we should have clear documentation about check() and nextDoc() and the 
possibility that check will be called for doc Id 'X' and later nextDoc or 
advance will be called with 'X', in that case the impl must ensure 'X' is not 
skipped, as is done today by TermScorer for example.

So should I add this check()?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

R-trees in Lucene for spatio-textual search

2009-04-30 Thread Mukherjee, Prasenjit

Hi, 
   Has anybody recently used lucene to implement R-trees for
range-queries. I came across the GeoLucene project but not sure how
stable/efficient it is to use it for production. Any pointers in this
direction will be great. 
 
Thanks
-P

[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector


 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1593:
---

Attachment: LUCENE-1593.patch

Patch includes:
# New scoresDocsInOrder to Query
#* Default to false
#* Override in extensions to return true, except in BQ which still returns 
false until we resolve how BQ is used explicitly (top-score vs. not). In some 
queries that delegate the work, I used the delegatee or return true if all 
sub-queries return true.
# Changed TopFieldCollector and TopScoreDocCollector to take a 
docsScoredInOrder parameter and create the appropriate instance (breaking ties 
by doc Id or not).
# Added TestTopScoreDocCollector and a test case to TestSort which test 
out-of-order collection (they trigger the use of BooleanScorer, though whether 
document collection happens truly out of order I cannot tell).
# Updates to CHANGES

All tests pass, including test-tag. BTW, the patch also includes the fix to 
TestSort in tag, but without the fix for MultiSearcher and 
ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as 
well.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it


[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704590#action_12704590
 ] 

Yonik Seeley commented on LUCENE-1536:
--

Interesting stuff!
Has anyone tested if this results in a performance degradation on 
SegmentTermDocs?
This is very inner loop stuff, and it's replacing a non virtual 
BitVector.get() which can be easily inlined with two dispatches through base 
classes.  Hopefully hotspot could handle it, but it's tough to figure out, esp 
in a real system where sometimes a user RAF will be used and sometimes not.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704594#action_12704594
]

Michael McCandless commented on LUCENE-1593:

bq. Not in this issue though, right?

Right: I'm back into the mode of throwing out all future improvements I know
of, to help guide us in picking the right next step. These would all be done
in separate issues, and many of them would not be done today but still we
should try not to preclude them for tomorrow.

{quote}
I like the idea of having Scorer be able to tell why a doc was matched. But I
think we should make sure that if a user is not interested in this information,
then he should not incur any overhead by it, such as aggregating information
in-memory or doing any extra computations. Something like we've done for
TopFieldCollector with tracking document scores and maxScore.
{quote}

Exactly, and I think/hope this'd be achievable.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: R-trees in Lucene for spatio-textual search

2009-04-30 Thread Uwe Schindler

R-trees are for spatial queries (two dimensions). If you want to have the
same for one-dimensional range queries, use TrieRangeQuery (see 2.9's
contrib queries package, this may move to core as a NumericRangeQuery etc.),
which is very stable and tested since years, but now to be included in
Lucene.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Mukherjee, Prasenjit [mailto:p.mukher...@corp.aol.com] 
Sent: Thursday, April 30, 2009 12:43 PM
To: java-dev@lucene.apache.org
Subject: R-trees in Lucene for spatio-textual search

 

Hi, 

   Has anybody recently used lucene to implement R-trees for range-queries.
I came across the GeoLucene project but not sure how stable/efficient it is
to use it for production. Any pointers in this direction will be great. 

 

Thanks

-P

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes


[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704597#action_12704597
 ] 

Shai Erera commented on LUCENE-1518:


bq. Wrapping with CSQ is just adding anothe layer between Lucene search 
machinery and Filter, making these optimizations harder.

Right. But making Filter sub-class Query and check in BQ 'if (query instanceof 
Filter) { Filter f = (Filter) query)' is not going to improve anything. It adds 
instanceof and casting, and I'd think those are more expensive than wrapping a 
Filter with CSQ and returning an appropriate Scorer, which will use the Filter 
in its next() and skipTo() calls.

bq. On the other hand, I must accept, conceptually FIter and Query are the 
same, supporting together following options

I think that if we allow BooleanClause to implement a Weight(IndexReader) (just 
like Query) we'll be one more step closer to that goal? BQ uses this method to 
construct BooleanWeight, only today it calls clause.getQuery().createWeight(). 
Instead it could do clause.getWeight, and if the BooleanClause holds a Filter 
it will return a FilterWeight, otherwise delegate that call to the contained 
Query.

Regarding pure ranked, CSQ is really what we need, no?

So how about the following:
# Add add(Filter, Occur) to BooleanClause.
# Add weight(Searcher) to BooleanClause.
# Create a FilterWeight which wraps a Filter and provide a Scorer 
implementation with a constant score. (This does not handle the no scoring 
mode, unless no scoring can be achieved with score=0.0f, while constant is 
any other value, defaulting to 1.0f).
# Add isRandomAccess to Filter.
# Create a RandomAccessFilter which extends Filter and defines an additional 
seek(target) method.
# Add asRandomAccessFilter() to Filter, which will materialize that Filter into 
memory, or into another RandomAccess data structure (e.g. keeping it on disk 
but still provide random access to it, even if not very efficient) and return a 
RandomAccessFilter type, which will implement seek(target) and possibly 
override next() and skipTo(), but still use whatever other methods this Filter 
declares.
#* I think we should default it to throw UOE providing that we document that 
isRandomAccess should first be called.

I'm thinking out loud just like you, so I hope my stuff makes sense :).

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605
]

Paul Elschot commented on LUCENE-1518:
--

How about materializing the DocIds _and_ the score values?

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704608#action_12704608
 ] 

Shai Erera commented on LUCENE-1593:


bq. BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one 
to ask the Query for either/or? 

I'm afraid this might mean a larger change. What will TermQuery do? Today it 
returns true, and does not have any implementation that can return docs 
out-of-order. So what should TQ do when outOfOrderScorer is called? Just return 
what inOrderScorer returns, or throw an exception?

That that there might be a Collector out there that requires docs in order is 
not something I think we should handle. Reason is, there wasn't any guarantee 
until today that docs are returned in order. So how can somehow write a 
Collector which has a hard assumption on that? Maybe only if he used a Query 
which he knows always scores in order, such as TQ, but then I don't think this 
guy will have a problem since TQ returns true.

And if that someone needs docs in order, but the query at hand returns docs out 
of order, then I'd say tough luck :)? I mean, maybe with BQ we can ensure 
in/out of order on request, but if there will be a query which returns docs in 
random, or based on other criteria which causes it to return out of order, what 
good will forcing it to return docs in order do? I'd say that you should just 
use a different query in that case?

bq. But I'm not sure in practice when one would want to use an out-of-order 
non-top iterator.

I agree. I think that iteration on Scorer is dictated to be in order because it 
extends DISI with next() and skipTo() methods which don't imply in any way they 
can return something out of order (besides next() maybe, but it will be hard to 
use such next() with a skipTo()).

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605
]

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:19 AM:
---

bq. opening option to materialize Query (that is how we today create Filters
today)

How about materializing the DocIds _and_ the score values?

was (Author: paul.elsc...@xs4all.nl):
How about materializing the DocIds _and_ the score values?

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes


[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704600#action_12704600
 ] 

Michael McCandless commented on LUCENE-1518:


Let's not forget we also have provides scores but NOT filtering type things 
as well, eg function queries, MatchAllDocsQuery, I want to boost documents by 
recency use case (which sort of a Scorer filter in that it takes another 
Scorer and modifies its output, per doc), etc.

It's just that very often the scoring part is in fact very much intertwined 
with the filtering part.  EG a TermQuery iterates a SegmentTermDocs, and 
reads  holds freq/doc in pairs.


 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
]

Paul Elschot commented on LUCENE-1518:
--

bq. Create a FilterWeight which wraps a Filter and provide a Scorer
implementation with a constant score. (This does not handle the no scoring
mode, unless no scoring can be achieved with score=0.0f, while constant is
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no
scoring case is handled by not asking for score values.

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704603#action_12704603
]

Michael McCandless commented on LUCENE-1593:

BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one to
ask the Query for either/or?

Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just
lower performance.

Sure, our top doc collectors can accept in order or out of order
collection, but perhaps one has a collector out there that must get the docs in
order, so shouldn't we be able to ask the Query to give us docs always in
order or doesn't have to be in order?

Also: I wonder if we would ever want to allow for non-top-scorer usage that
does not return docs in order? Ie, next() would be allowed to yield docs out
of order. Obviously this is not allowed today... but we are now mixing top vs
not-top with out-of-order vs in-order, where maybe they should be
independent? But I'm not sure in practice when one would want to use an
out-of-order non-top iterator.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
]

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:27 AM:
---

The current patch at LUCENE-1345 does not need such a FilterWeight; the no
scoring case is handled by not asking for score values.
Using score=0.0f for no scoring might not work for BooleanQuery because it also
has a coordination factor that depends on the number of matching sub_queries_.
The patch at 1345 does not change that coordination factor for backward
compatibility, even though the coordination factor might also depend on the
number of a matching filter clauses.

was (Author: paul.elsc...@xs4all.nl):
bq. Create a FilterWeight which wraps a Filter and provide a Scorer
implementation with a constant score. (This does not handle the no scoring
mode, unless no scoring can be achieved with score=0.0f, while constant is
any other value, defaulting to 1.0f).

The current patch at LUCENE-1345 does not need such a FilterWeight; the no
scoring case is handled by not asking for score values.

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609
]

Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:29 AM:
---

The current patch at LUCENE-1345 does not need such a FilterWeight; the no
scoring case is handled by not asking for score values.
Using score=0.0f for no scoring might not work for BooleanQuery because it also
has a coordination factor that depends on the number of matching queries
clauses. The patch at 1345 does not change that coordination factor for
backward compatibility, even though the coordination factor might also depend
on the number of a matching filter clauses.

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704613#action_12704613
]

Eks Dev commented on LUCENE-1518:
-

Shai,
Regarding pure ranked, CSQ is really what we need, no? ---

Yep, it would work for Filters, but why not making it possible to have normal
Query constant score. For these cases, I am just not sure if this aproach
gets max performance (did not look at this code for quite a while).

Imagine you have a Query and you are not interested in Scoring at all, this can
be acomplished with only DocID iterator arithmetic, ignoring score() totally.
But that is only an optimization (maybe allready there?)

Paul,
How about materializing the DocIds _and_ the score values?
exactly, that would open full caching posibility (original purpose of
Filters). Think Search Results caching ... that is practically another name
for search() method. It is easy to create this, but using it again would
require some bigger changes :)

Filter_on_Steroids materialize(boolean without_score);

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704618#action_12704618
]

Eks Dev commented on LUCENE-1518:
-

Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight;
the no scoring case is handled by not asking for score values...

Me: ...Imagine you have a Query and you are not interested in Scoring at all,
this can be acomplished with only DocID iterator arithmetic, ignoring score()
totally. But that is only an optimization (maybe allready there?)...

I knew Paul will kick in at this place, he sad exactly the same thing I did,
but, as oposed to me, he made formulation that executes :)
Pfff, I feel bad :)

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1313) Realtime Search

[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704628#action_12704628
]

Michael McCandless commented on LUCENE-1313:

{quote}
Perhaps the best way to make this clean is to keep the
ram merge policy and real dir merge policies different? That way
we don't merge policy implementations don't need to worry about
ram and non-ram dir cases.
{quote}

OK tentatively this feels like a good approach. Would you re-use
MergePolicy, or make a new RAMMergePolicy?

Would we use the same MergeScheduler to then execute the selected
merges?

How would we handle the it's time to flush some RAM to disk case?
Would RAMMergePolicy make that decision?

bq. Perhaps an IW.updatePendingRamMerges method should be added that handles
this separately?

Yes?

bq. Does the ram dir ever need to worry about things like
maxNumSegmentsOptimize and optimize?

No?

{quote}
I think having the ram merge policy should cover the reasons I
had for having a separate ram writer. Although the IW.addWriter
method I implemented would not have blocked, but I don't think
it's necessary now if we have a separate ram merge policy.
{quote}

OK good.

Realtime Search
---

Key: LUCENE-1313
URL: https://issues.apache.org/jira/browse/LUCENE-1313
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch,
LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch,
lucene-1313.patch, lucene-1313.patch

Realtime search with transactional semantics.
Possible future directions:
* Optimistic concurrency
* Replication
Encoding each transaction into a set of bytes by writing to a RAMDirectory
enables replication. It is difficult to replicate using other methods
because while the document may easily be serialized, the analyzer cannot.
I think this issue can hold realtime benchmarks which include indexing and
searching concurrently.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704632#action_12704632
 ] 

Michael McCandless commented on LUCENE-1614:


bq. So should I add this check()?

Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently 
implement check().

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704631#action_12704631
 ] 

Michael McCandless commented on LUCENE-1614:


bq. I mean, instead of having the code call check(8), get true and then 
advance(8), just call checkAndAdvance(8) which returns true if 8 is supported 
and false otherwise, AND moves to 8.

Oh, sorry: that's in fact what I intended check to do.  But by moves to 8 
what it really means is you now cannot call check on anything  8 (maybe 9)

I think after check(N) is called, one cannot call doc() -- the results are not 
defined.  So check(N) logically puts the iterator at N, and you may at that 
point call next if you want, or call another check(M) but you cannot call doc() 
right after check.

bq. So should I add this check()?

I think so?  We can then do perf tests of that vs filter-as-BooleanClause?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704633#action_12704633
 ] 

Shai Erera commented on LUCENE-1614:


bq. I think after check(N) is called, one cannot call doc()

I think one cannot even call next(). If check(8) returns true, then you know 
that doc() will return 8 (otherwise it's a bug?). But if it returns false, it 
might be in 10 already, so calling next() will move it to 11 or something. So 
to be on the safe side, we should document that doc()'s result is unspecified 
if check() returns false, and next() is not recommended in that case, but 
skipTo() or check(M).

bq. Though, in order to run perf tests, we'd need the AND/OR scorers to 
efficiently implement check().

I plan to, as much as I can, efficiently implement nextDoc() and advance() in 
all Scorers/DISIs. So I can include check() in the list as well. Or .. maybe 
you know something I don't and you think this should deserve its own issue?

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1607) String.intern() faster alternative


 [ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1607:
-

Attachment: LUCENE-1607.patch

bq. Yonik, the string is being interned twice in your latest patch 

Thanks - I had actually fixed that... but it didn't make it into the patch 
apparently :-)

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704640#action_12704640
]

Yonik Seeley commented on LUCENE-1593:
--

docsInOrder() would be an implementation detail (and could actually vary per
reader or per segment) and should be on the Scorer/DocIdSetIterator rather than
the Query or Weight, right?

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: What are we allowed to do in 3.0?

2009-04-30 Thread Michael McCandless

I think 3.0 should be a fast turnaround after 2.9?  Ie, no new
development should take place?  We should remove deprecated APIs,
change defaults, etc., but that's about it.  (I think this is how past
major releases worked?).  It's a fast switch.  Which then means we
need to do all the hard work in 2.9...

So, I think any API changes we want to make must be present in 2.9 as
deprecations.  We shouldn't up and remove / rename something in 3.0
with no fore-warning in 2.9.  Is there a case where this is too
painful?

Likewise I think we should give notification of expected changes in
runtime behavior with 2.9 (and not suddenly do them in 3.0).

 Which means upon releasing 2.9, the very first issue that's opened and 
 committed should be removing the current deprecated methods, or otherwise we 
 could open an issue that deprecates a method and accidentally remove it 
 later, when we handle the massive deprecation removal.

I think we should not target JAR drop-inability, and we should allow
changes to runtime behavior, as well as certain minor API changes in
3.0.  EG here are some of the changes already slated for 3.0:

  * IndexReader.open returns readOnly reader by default

  * IndexReader.norms returns null on fields that don't have norms

  * InterruptedException is thrown by many APIs

  * IndexWriter.autoCommit is hardwired to false

  * Things that now return the deprecated IndexCommitPoint (interface)
will be changed to return IndexCommit (abstract base class)

  * Directory.list will be removed; Directory.listAll will become an
abstract method

  * Stop tracking scores by default when sorting by field

 But recently, while working on LUCENE-1593, me and Mike spotted a need to add 
 some methods to Weight, but since it is an interface we can't. So I said 
 something let's do it in 3.0 but then we were not sure if this can be done 
 in 3.0.

I think for this we should probably introduce an abstract base class
(implementing Weight) in 2.9, stating that Weight interface will be
removed in 3.0?  (EG, this is what was done for
IndexCommitPoint/IndexCommit).  Simply changing Weight to be an
abstract class in 3.0 is spooky because Java is single inheritance, ie
for existing classes that implements Weight but subclass something
else it would be nicer to give a heads up with 2.9 that they'll need
to refactor?

Mike

On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 Recently I was involved in several issues that required some runtime changes
 to be done in 3.0 and it was not so clear what is it that we're actually
 allowed to do in 3.0. So I'd like to open it for discussion, unless
 everybody agree on it already.

 So far I can tell that 3.0 allows us to:
 1) Get rid of all the deprecated methods
 2) Move to Java 1.5

 But what about changes to runtime behavior, re-factoring a whole set of
 classes etc? I'd like to relate to them one by one, so that we can comment
 on each.

 Removing deprecated methods
 -
 As I was told, 2.9 is the last release we are allowed to mark methods as
 deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there
 is a method that should be renamed, its signature should change or be
 removed altogether, we can't just do it and we'd have to deprecate it and
 remove it in 4.0 (?). I personally thought that 2.9 allows us to make these
 changes without letting anyone know about them in advance, which I'm ok with
 since upgrading to 3.0 is not going to be as 'smooth', but I also understand
 why 'letting people know in advance' (which means a release prior to the one
 when they are removed) gives a better service to our users. I also thought
 that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0
 (but I was corrected previously on that).
 Which means upon releasing 2.9, the very first issue that's opened and
 committed should be removing the current deprecated methods, or otherwise we
 could open an issue that deprecates a method and accidentally remove it
 later, when we handle the massive deprecation removal. We should also create
 a 2.9 tag.

 Changes to runtime behavior
 -
 What is exactly the policy here? If we document in 2.9 that certain
 features' runtime behavior will change in 3.0 - is that ok to make those
 changes? And if we don't document them and do it (in the transition from
 2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0
 to run all of his unit tests to assert that everything still works (I expect
 that to happen for every release, but for 3.0 in particular). Obviously the
 runtime behaviors that were changed and documented in 2.9 are ones that he
 might have already taken care of, but why can't he do the same reading the
 CHANGES of 3.0?
 I just feel that this policy forces us to think really hard and foresee
 those changes in runtime behavior that we'd like to do in 3.0 so that we can
 get them

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704665#action_12704665
 ] 

Michael McCandless commented on LUCENE-1614:


bq. I think one cannot even call next().

Hmm, yeah I think you're right.  We could perhaps make this an entirely 
different interface (abstract class).  Ie, one should not mix and match 
checking with next/advanceing.  In the case I can think of, at least, it's 
an up-front decision as to which scorer does next vs check.

bq.  So I can include check() in the list as well.

I think including it in this issue is fine.

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704674#action_12704674
]

Yonik Seeley commented on LUCENE-1593:
--

Query objects are relatively abstract. Weights are created only with respect
to a Searcher, and Scorers are created only from within that context with
respect to an IndexReader. It really seems like we should maintain this
separation and avoid putting implementation details into the Query object (or
the Weight object for that matter).

bq. A user might want to know what Collector implementation to create before
calling search(Query, Collector)

Having to create a certain type of collector sounds error prone.
Why not reverse the flow of information and tell the Weight.scorer() method if
an out-of-order scorer is acceptable via some flags or a context object. This
is also not backward compatible because Weight is an interface, so perhaps this
optimization will just have to wait.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704692#action_12704692
 ] 

Michael McCandless commented on LUCENE-1593:


bq. What will TermQuery do?

Oh: it's fine to return an in-order scorer, always. It's just that if
a Query wants to use an out-of-order scorer, it should also implement
an in-order one.  Ie, there'd be a mating process to match the
scorer to the collector.

That that there might be a Collector out there that requires docs in order is 
not something I think we should handle. Reason is, there wasn't any guarantee 
until today that docs are returned in order. So how can somehow write a 
Collector which has a hard assumption on that? Maybe only if he used a Query 
which he knows always scores in order, such as TQ, but then I don't think this 
guy will have a problem since TQ returns true.

bq. And if that someone needs docs in order, but the query at hand returns docs 
out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure 
in/out of order on request, but if there will be a query which returns docs in 
random, or based on other criteria which causes it to return out of order, what 
good will forcing it to return docs in order do? I'd say that you should just 
use a different query in that case?

Well... we have to be careful.  EG say we had some great optimization
for iterating over matches to PhraseQuery, but it returned docs out of
order.  In that case, I think we'd preserve the in-order Scorer as
well?

bq. But I'm not sure in practice when one would want to use an out-of-order 
non-top iterator.

One case might be a random access filter AND'd w/ a BooleanQuery.  In
that case I could ask for a BooleanScorer to return a DISI whose next
is allowed to return docs out of order, because 1) my filter doesn't
care and 2) my collector doesn't care.

Though, we are thinking about pushing random access filters all the
way down to the TermScorer, so this is example isn't realistic in that
future... but it still feels like out of order iteration and I'm
top scorer or not are orthogonal concepts.


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704693#action_12704693
]

Michael McCandless commented on LUCENE-1593:

One further optimization can be enabled if we can properly mate
out-of-orderness between Scorer Collector: BooleanScorer could be
automatically used when appropriate.

Today, one must call BooleanQuery.setAllowDocsOutOfOrder which is
rather silly (it's very much an under the hood detail of how the
Scorer interacts w/ the Collector). The vast majority of time it's
Lucene that creates the collector, and so now that we can create
Collectors that either do or do not care if docs arrive out of order,
we should allow BooleanScorer when we can.

Though that means we have two ways to score a BooleanQuery:

* Use BooleanScorer2 w/ a Collector that doesn't fallback to docID
to break ties

* Use BooleanScorer w/ a Collector that does fallback

We'd need to test which is most performant (I'm guessing the 2nd
one).

So maybe we should in fact add a acceptsDocsOutOfOrder to
Collector.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704699#action_12704699
]

Michael McCandless commented on LUCENE-1593:

bq. This is also not backward compatible because Weight is an interface, so
perhaps this optimization will just have to wait.

Yonik would you suggest we migrate Weight to be an abstract class
instead? (This is also being discussed in a separate thread on
java-dev, if you want to respond there...).

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704703#action_12704703
]

Michael McCandless commented on LUCENE-1593:

Yonik does Solr have any Scorers that iterate on docs out of order? Or is
BooleanScorer the only one we all know about?

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704702#action_12704702
]

Michael McCandless commented on LUCENE-1593:

{quote}
IndexSearcher creates the Collector before it obtains a Scorer. Therefore all
it has at hand is the Weight. Since Weight is an interface, we can't change it,
so I added it to Query with a default of false.
{quote}

In early iterations on LUCENE-1483, we allowed Collector.setNextReader
to return a new Collector on the possibility that a new segment might
require different collector.

We could consider going back to that... and allowing the builtin
collectors to receive a Scorer on creation, which they could interact
with to figure out in/out of order types of issues. We could then
also enrich setNextReader a bit to also receive a Scorer, so that if
somehow the Scorer for the next segment switched to be in-order vs
out-of-order, the Collector could properly respond.

Or we could require homogeneity for Scorer across all segments
(which'd be quite a bit simpler).

{quote}
Why not reverse the flow of information and tell the Weight.scorer() method if
an out-of-order scorer is acceptable via some flags or a context object. This
is also not backward compatible because Weight is an interface, so perhaps this
optimization will just have to wait.
{quote}

I tentatively like this approach, ie add an API to Collector for it to
declare if it can handle out-of-order collection, and then ask for the
right Scorer.

But still internal creation of Collectors could go both ways, and so
we should retain the freedom to optimize (the BooleanScorer example
above).

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704705#action_12704705
]

Michael McCandless commented on LUCENE-1593:

bq. BooleanScorer could be automatically used when appropriate

If we do this (and I think we should -- good perf gains, though I haven't
tested just how good, recently), then we should deprecate
setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder).
And make it clear that internally Lucene may ask for either scorer, depending
on the collector.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present


[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704711#action_12704711
 ] 

Michael McCandless commented on LUCENE-1252:


Here's a simple example that might drive this issue forward:

   +h1n1 flu +united states

Ideally, to score this query, you'd want to first AND all 4 terms
together, and only for docs matching that, consult the positions of
each pair of terms.

But we fail to do this today.

It's like somehow Weight.scorer() needs to be able to return a
cheap and an expensive scorer (which must be AND'd).  I think
PhraseQuery would somehow return cheap/expensive scorers that under
the hood share the same SegmentTermDocs/Positions iterators, such that
after cheap.next() has run, cheap.expensive only needs to check the
current doc.  So in fact maybe the expensive scorer should not be a
Scorer but some other simple passes or doesn't API.

Or maybe it returns say a TwoStageScorer, which adds a
reallyPasses() (needs better name) method to otherwise normal the
Scorer (DISI) API.

Or something else

 Avoid using positions when not all required terms are present
 -

 Key: LUCENE-1252
 URL: https://issues.apache.org/jira/browse/LUCENE-1252
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Paul Elschot
Priority: Minor

 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
 currently next() and skipTo() will use position information even when other 
 parts of the query cannot match because some required terms are not present.
 This could be avoided by adding some methods to Scorer that relax the 
 postcondition of next() and skipTo() to something like all required terms 
 are present, but no position info was checked yet, and implementing these 
 methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
 SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-30 Thread Marvin Humphrey (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704720#action_12704720
]

Marvin Humphrey commented on LUCENE-1593:
-

I made Weight a subclass of Query and all of a sudden Searcher method
signatures got easier to manage.

PS: Is this a good place to discuss why [having rambling conversations in the
bug tracker is a bad idea|http://producingoss.com/en/bug-tracker-usage.html],
or should I open a new issue?

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

[
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704721#action_12704721
]

Yonik Seeley commented on LUCENE-1593:
--

bq. Yonik does Solr have any Scorers that iterate on docs out of order? Or is
BooleanScorer the only one we all know about?

Nope. BooleanScorer is the only one I know about. And it's sort of special
too... it's not like BooleanScorer can accept out-of-order scorers as
sub-scorers itself - the ids need to be delivered in the range of the current
bucket. IMO custom out-of-order scorers aren't supported in Lucene.

Optimizations to TopScoreDocCollector and TopFieldCollector
---

Key: LUCENE-1593
URL: https://issues.apache.org/jira/browse/LUCENE-1593
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

dbsight

2009-04-30 Thread Michael Masters

Hello Everyone,

I just started to use lucene recently. Great project BTW. I was
wondering if anyone has suggested making an open source version of
dbsight (www.dbsight.net/). I've just started using it and I think it
would be awesome if it was open source. Does anyone know of a project
that's like this that is OS?

If not, then how can I propose a project that does a similar thing?

Thanks,
Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: dbsight

2009-04-30 Thread Jason Rutherglen

Hi Mike,

You may want to ask your question on java-u...@lucene.apache.org

-J

On Thu, Apr 30, 2009 at 11:59 AM, Michael Masters mmast...@gmail.comwrote:

 Hello Everyone,

 I just started to use lucene recently. Great project BTW. I was
 wondering if anyone has suggested making an open source version of
 dbsight (www.dbsight.net/). I've just started using it and I think it
 would be awesome if it was open source. Does anyone know of a project
 that's like this that is OS?

 If not, then how can I propose a project that does a similar thing?

 Thanks,
 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME


 [ 
https://issues.apache.org/jira/browse/LUCENE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1611.


   Resolution: Fixed
Fix Version/s: 2.4.2

Thanks Christiaan!

 Do not launch new merges if IndexWriter has hit OOME
 

 Key: LUCENE-1611
 URL: https://issues.apache.org/jira/browse/LUCENE-1611
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.4.2, 2.9

 Attachments: LUCENE-1611-241.patch, LUCENE-1611.patch, 
 LUCENE-1611.patch


 if IndexWriter has hit OOME, it defends itself by refusing to commit changes 
 to the index, including merges.  But this can lead to infinite merge attempts 
 because we fail to prevent starting a merge.
 Spinoff from 
 http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: What are we allowed to do in 3.0?

2009-04-30 Thread Shai Erera

So I understand from your responses that you think that 2.9 should include
as much as possible, so that a user will have ~0 work upgrading from 2.9 to
3.0, assuming he upgraded fully to 2.9 (moved to use the not-deprecated APIs
etc.).
If 3.0 is supposed to be released quickly after 2.9 then this makes sense,
and leaves very small room for sudden major changes anyway.

Also I read that you do support introducing minor changes to the API as well
as runtime behvavior, but still prefer that we do them in 2.9. So
refactoring like we discussed - changing all interfaces to abstract classes
- should not happen in 2.9-3.0, which makes sense. I think anyway this type
of refactoring should happen one at a time, and not a complete overhaul of
the code.

Well ... I guess that if that's the case (releasing 3.0 soon after we
release 2.9, and introducing very minor and few changes to API and runtime
behavior that were not documented in 2.9 may be acceptable), then we should
be fine and this very short discussion (I somehow expected much more
responses) can end.

Thanks a lot for clarifying that to me.

Shai

On Thu, Apr 30, 2009 at 6:20 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 I think 3.0 should be a fast turnaround after 2.9?  Ie, no new
 development should take place?  We should remove deprecated APIs,
 change defaults, etc., but that's about it.  (I think this is how past
 major releases worked?).  It's a fast switch.  Which then means we
 need to do all the hard work in 2.9...

 So, I think any API changes we want to make must be present in 2.9 as
 deprecations.  We shouldn't up and remove / rename something in 3.0
 with no fore-warning in 2.9.  Is there a case where this is too
 painful?

 Likewise I think we should give notification of expected changes in
 runtime behavior with 2.9 (and not suddenly do them in 3.0).

  Which means upon releasing 2.9, the very first issue that's opened and
 committed should be removing the current deprecated methods, or otherwise we
 could open an issue that deprecates a method and accidentally remove it
 later, when we handle the massive deprecation removal.

 I think we should not target JAR drop-inability, and we should allow
 changes to runtime behavior, as well as certain minor API changes in
 3.0.  EG here are some of the changes already slated for 3.0:

  * IndexReader.open returns readOnly reader by default

  * IndexReader.norms returns null on fields that don't have norms

  * InterruptedException is thrown by many APIs

  * IndexWriter.autoCommit is hardwired to false

  * Things that now return the deprecated IndexCommitPoint (interface)
will be changed to return IndexCommit (abstract base class)

  * Directory.list will be removed; Directory.listAll will become an
abstract method

  * Stop tracking scores by default when sorting by field

  But recently, while working on LUCENE-1593, me and Mike spotted a need to
 add some methods to Weight, but since it is an interface we can't. So I said
 something let's do it in 3.0 but then we were not sure if this can be done
 in 3.0.

 I think for this we should probably introduce an abstract base class
 (implementing Weight) in 2.9, stating that Weight interface will be
 removed in 3.0?  (EG, this is what was done for
 IndexCommitPoint/IndexCommit).  Simply changing Weight to be an
 abstract class in 3.0 is spooky because Java is single inheritance, ie
 for existing classes that implements Weight but subclass something
 else it would be nicer to give a heads up with 2.9 that they'll need
 to refactor?

 Mike

 On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Recently I was involved in several issues that required some runtime
 changes
  to be done in 3.0 and it was not so clear what is it that we're actually
  allowed to do in 3.0. So I'd like to open it for discussion, unless
  everybody agree on it already.
 
  So far I can tell that 3.0 allows us to:
  1) Get rid of all the deprecated methods
  2) Move to Java 1.5
 
  But what about changes to runtime behavior, re-factoring a whole set of
  classes etc? I'd like to relate to them one by one, so that we can
 comment
  on each.
 
  Removing deprecated methods
  -
  As I was told, 2.9 is the last release we are allowed to mark methods as
  deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel
 there
  is a method that should be renamed, its signature should change or be
  removed altogether, we can't just do it and we'd have to deprecate it and
  remove it in 4.0 (?). I personally thought that 2.9 allows us to make
 these
  changes without letting anyone know about them in advance, which I'm ok
 with
  since upgrading to 3.0 is not going to be as 'smooth', but I also
 understand
  why 'letting people know in advance' (which means a release prior to the
 one
  when they are removed) gives a better service to our users. I also
 thought
  that jar drop-in-ability is not

[jira] Updated: (LUCENE-1313) Realtime Search

2009-04-30 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-1313:
-

Attachment: LUCENE-1313.patch

{quote} Would you re-use MergePolicy, or make a new
RAMMergePolicy? {quote}

MergePolicy is used as is with a special IW method that handles
merging ram segments for the real directory (which has an issue
around merging contiguous segments, can that be relaxed in this
case as I don't understand why this is?)

The patch is not committable, however I am posting it to show a
path that seems to work. It includes test cases for merging in
ram and merging to the real directory.

* IW.getFlushDirectory is used by internal calls to obtain the
directory to flush segments to. This is used in DocumentsWriter
related calls.

* DocumentsWriter.directory is removed so that methods requiring
the directory call IW.getFlushDirectory instead.

* IW.setRAMDirectory sets the ram directory to be used.

* IW.setRAMMergePolicy sets the merge policy to be used for
merging segments on the ram dir.

* In IW.updatePendingMerges totalRamUsed is the size of the ram
segments + the ram buffer used. If totalRamUsed exceeds the max
ram buffer size then IW. updatePendingRamMergesToRealDir is
called.

* IW. updatePendingRamMergesToRealDir registers a merge of the
ram segments to the real directory (currently causes a
non-contiguous segments exception)

* MergePolicy.OneMerge has a directory attribute used when
building the merge.info in _mergeInit.

* Test case includes testMergeInRam, testMergeToDisk,
testMergeRamExceeded

There is one error that occurs regularly in testMergeRamExceeded
{code} MergePolicy selected non-contiguous segments to merge
(_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1-_bj _bp:cx1-_bp _bq:cx1-_bp
_c2:cx1-_c2 _c3:cx1-_c2 _c4:cx1-_c2 vs _5x:c120 _6a:c8
_6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1-_bj**
_bp:cx1-_bp** _bq:cx1-_bp** _c1:c10 _c2:cx1-_c2**
_c3:cx1-_c2** _c4:cx1-_c2**), which IndexWriter (currently)
cannot handle {code}

Realtime Search
---

Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch,
LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch,
lucene-1313.patch, lucene-1313.patch, lucene-1313.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector