Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Michael McCandless
Urgh, right.  Can't we simply restore the AUTO resolution into it?
Existing (direct) usage of it must be passing in the top-level
IndexReader.  (Lucene doesn't use FSHQ internally).

Mike

On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote:
 Just got off the train and ny to ct has a brilliant bar car, so lest I
 forget:

 1483 moved auto resolution from fshq to indexsearcher - which is a back
 compat break if you were using a fshq without indexsearcher (Solr does it -
 anyone could). Annoying. If I remember right, I did it to resolve auto on
 the multireader rather than each individual segment reader. So the change is
 needed and not allowed. Perhaps it could just re-resolve like before though
 - if indexsearcher has already resolved, fine, otherwise it will be done
 again at the fshq level. Ill issue it up later.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



I am unable to create index of an object having composite key

2009-04-24 Thread gopalbisht

Hi all,

I am using hibernate search with lucene. I need to create index of DomainTag
object  which have only one composite key. I am unware  how to define the
annotations for the composite key in DomainTag (pojo) class.
If any one can help , please help me. Thanks  in advance.

My DomainTag.hbm.xml file is as follows:
--
?xml version=1.0?
!DOCTYPE hibernate-mapping PUBLIC
-//Hibernate/Hibernate Mapping DTD 3.0//EN
http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd; 

hibernate-mapping
!-- 
Created by the Middlegen Hibernate plugin 2.2

http://boss.bekk.no/boss/middlegen/
http://www.hibernate.org/
--

class 
name=com.test.manager.DomainTag 
table=domaintag 


composite-id
!-- Associations --
!-- derived association(s) for compound key --
!-- bi-directional many-to-one association to Item --
 !-- Associations --
!-- derived association(s) for compound key --
!-- bi-directional many-to-one association to Item --
key-many-to-one
name=topic
class=com.test.manager.DomainTest
column name=domain_id /
/key-many-to-one

!-- bi-directional many-to-one association to Tag --
key-many-to-one
name=tag
class=com.test.manager.Tag
column name=tag_id /
/key-many-to-one

/composite-id  

/class
/hibernate-mapping
--
-- 
View this message in context: 
http://www.nabble.com/I-am-unable-to-create-index-of-an-object-having-composite-key-tp23211575p23211575.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Mark Miller
Ah, right - thats basically what I was suggesting, though I was thinking 
that we might need to resolve twice, but of course your right and we 
don't have to. Lucene does still use fshq for back compat (for custom 
field source I think?), but I can just do the auto resolution in 
IndexSearcher only in non legacy mode and restore auto resolution to 
fshq. That will avoid the harmless but wasteful double resolve in legacy 
mode. Had no code to look at when I sent that out.


- Mark

Michael McCandless wrote:

Urgh, right.  Can't we simply restore the AUTO resolution into it?
Existing (direct) usage of it must be passing in the top-level
IndexReader.  (Lucene doesn't use FSHQ internally).

Mike

On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote:
  

Just got off the train and ny to ct has a brilliant bar car, so lest I
forget:

1483 moved auto resolution from fshq to indexsearcher - which is a back
compat break if you were using a fshq without indexsearcher (Solr does it -
anyone could). Annoying. If I remember right, I did it to resolve auto on
the multireader rather than each individual segment reader. So the change is
needed and not allowed. Perhaps it could just re-resolve like before though
- if indexsearcher has already resolved, fine, otherwise it will be done
again at the fshq level. Ill issue it up later.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Mark Miller
Eh - we have to spin through all the fields to check for legacy first 
anyway. Just doing it twice in legacy mode is probably best? I think 
there is no way to avoid spinning through the fields twice in any case 
(you have to spin through the sort fields to know you dont want to spin 
through the sort fields), so I guess I just go with that.


- Mark

Mark Miller wrote:
Ah, right - thats basically what I was suggesting, though I was 
thinking that we might need to resolve twice, but of course your right 
and we don't have to. Lucene does still use fshq for back compat (for 
custom field source I think?), but I can just do the auto resolution 
in IndexSearcher only in non legacy mode and restore auto resolution 
to fshq. That will avoid the harmless but wasteful double resolve in 
legacy mode. Had no code to look at when I sent that out.


- Mark

Michael McCandless wrote:

Urgh, right.  Can't we simply restore the AUTO resolution into it?
Existing (direct) usage of it must be passing in the top-level
IndexReader.  (Lucene doesn't use FSHQ internally).

Mike

On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com 
wrote:
 

Just got off the train and ny to ct has a brilliant bar car, so lest I
forget:

1483 moved auto resolution from fshq to indexsearcher - which is a back
compat break if you were using a fshq without indexsearcher (Solr 
does it -
anyone could). Annoying. If I remember right, I did it to resolve 
auto on
the multireader rather than each individual segment reader. So the 
change is
needed and not allowed. Perhaps it could just re-resolve like before 
though
- if indexsearcher has already resolved, fine, otherwise it will be 
done

again at the fshq level. Ill issue it up later.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  






--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Mark Miller
I'm just putting the auto resolution back in fshq and there is no double 
check - foolish me. As I first said, we will either resolve first in 
back compat mode and it will be skipped in fshq, or it will be resolved 
in fshq for anyone using it externally (and not resolving first on there 
own). I'll commit the change shortly.


Mark Miller wrote:
Eh - we have to spin through all the fields to check for legacy first 
anyway. Just doing it twice in legacy mode is probably best? I think 
there is no way to avoid spinning through the fields twice in any case 
(you have to spin through the sort fields to know you dont want to 
spin through the sort fields), so I guess I just go with that.


- Mark

Mark Miller wrote:
Ah, right - thats basically what I was suggesting, though I was 
thinking that we might need to resolve twice, but of course your 
right and we don't have to. Lucene does still use fshq for back 
compat (for custom field source I think?), but I can just do the auto 
resolution in IndexSearcher only in non legacy mode and restore auto 
resolution to fshq. That will avoid the harmless but wasteful double 
resolve in legacy mode. Had no code to look at when I sent that out.


- Mark

Michael McCandless wrote:

Urgh, right.  Can't we simply restore the AUTO resolution into it?
Existing (direct) usage of it must be passing in the top-level
IndexReader.  (Lucene doesn't use FSHQ internally).

Mike

On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com 
wrote:
 

Just got off the train and ny to ct has a brilliant bar car, so lest I
forget:

1483 moved auto resolution from fshq to indexsearcher - which is a 
back
compat break if you were using a fshq without indexsearcher (Solr 
does it -
anyone could). Annoying. If I remember right, I did it to resolve 
auto on
the multireader rather than each individual segment reader. So the 
change is
needed and not allowed. Perhaps it could just re-resolve like 
before though
- if indexsearcher has already resolved, fine, otherwise it will be 
done

again at the fshq level. Ill issue it up later.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  









--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: I am unable to create index of an object having composite key

2009-04-24 Thread Erik Hatcher
You'll do best to direct this question to the Hibernate group.  java- 
dev is for Lucene development so not an appropriate Lucene place to  
ask.  java-user would be better, but your question is more Hibernate  
specific.


Erik


On Apr 24, 2009, at 3:49 AM, gopalbisht wrote:



Hi all,

I am using hibernate search with lucene. I need to create index of  
DomainTag
object  which have only one composite key. I am unware  how to  
define the

annotations for the composite key in DomainTag (pojo) class.
If any one can help , please help me. Thanks  in advance.

My DomainTag.hbm.xml file is as follows:
-- 


?xml version=1.0?
!DOCTYPE hibernate-mapping PUBLIC
   -//Hibernate/Hibernate Mapping DTD 3.0//EN
   http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd; 

hibernate-mapping
!--
   Created by the Middlegen Hibernate plugin 2.2

   http://boss.bekk.no/boss/middlegen/
   http://www.hibernate.org/
--

class
   name=com.test.manager.DomainTag
   table=domaintag




   composite-id
!-- Associations --
!-- derived association(s) for compound key --
!-- bi-directional many-to-one association to Item --
 !-- Associations --
!-- derived association(s) for compound key --
!-- bi-directional many-to-one association to Item --
key-many-to-one
name=topic
class=com.test.manager.DomainTest
column name=domain_id /
/key-many-to-one

!-- bi-directional many-to-one association to Tag --
key-many-to-one
name=tag
class=com.test.manager.Tag
column name=tag_id /
/key-many-to-one

/composite-id

/class
/hibernate-mapping
-- 


--
View this message in context: 
http://www.nabble.com/I-am-unable-to-create-index-of-an-object-having-composite-key-tp23211575p23211575.html
Sent from the Lucene - Java Developer mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702343#action_12702343
 ] 

Shai Erera commented on LUCENE-1593:


Few updates (it's been long since I posted on this issue):
* I tried to move to pre-populate the queue in TFC, but that proved to be 
impossible (unless I missed something). The only reason to do it was to get rid 
of the 'if (queueFull)' check in every collect. However, it turned out that 
pre-populating the queue for TFC with sentinel values is unreliable. Reason is, 
if I want to get rid of that 'if', I don't want to add any 'if' to 
FieldComparator.compare, so the sentinel values should be something like 
MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone 
can set a field value to either of these values (and there are tests that do), 
so I need to check if FC if the current value is a sentinel, which adds that 
'if' back, only worse - it's now executed in every compare() call. Unless I 
missed something, I don't think that's possible, or at least worth the effort 
(to get rid of one 'if').
** BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken 
if a Scorer decided to return that value, in which case pre-populating the 
queue will not work either. I think it is still safe in TSDC, but want to get 
your opinion.

* I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by 
Mike. But then TestSort.testMultiSort failed. So I debugged it and either the 
test works fine but there's a bug in MultiSearcher, or the test does not work 
fine (or should be adjusted) but then we'll be changing runtime behavior of 
Sort (e.g., everyone who used setSort might get undesired behavior, only in 
MultiSearcher).
** MultiSearcher's search(Weight, Filter, int, Sort) method executes a search 
against all its Searchables, sequentially. The test adds documents to two 
indexes, odd and even, so that the odd index is added two documents (A and E) 
with the value 5 in docs 0 and 2, and the even index is added one doc (B) 
with value 5 in index 0. When I use the 2 SortField version (2nd sorts by 
doc), the output is ABE, since the compare by doc uses the searcher's doc Ids 
(0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' 
doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField 
version, the result is AEB, since B's fixed doc Id is 7, and now the code 
breaks a tie on their *true* doc Id.

I hope you were able to follow my description. That's why I don't know which 
one is the true output ABE or AEB. I tend to say AEB, since those are the true 
doc Ids the application will get in the end. Few more comments:
# TestSort.runMultiSort contains 3 versions of this test:
#* One that sorts by the INT value only, relying on the doc Id tie breaking. 
You can see the output was not determined to be exact and so the pattern 
included is [ABE]{3}, as if the test's output is not predicted. I think that's 
wrong in the first place, we should always have predicted tests output, since 
we're not involving any randomness in that code.
#* The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E.
#* The third relies on setSort's adding DOC, so it expects the same [AB]{2}E.
# The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but 
doesn't update the DOC's field value the same as it does to the scoreDoc.doc 
value (adding the current Searcher's start).

Again, whatever the right output is, changing Sort to not include 
SortField.FIELD_DOC might result in someone experiencing a change in behavior 
(if he used MultiSearcher), even if that behavior is buggy.

What do you think?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization 

[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702379#action_12702379
 ] 

Yonik Seeley commented on LUCENE-1609:
--

Re: why the lazy load
http://www.lucidimagination.com/search/document/97e73361748808b/terminfosreader_lazy_term_index_reading#2a73aaca25d516ec

 Eliminate synchronization contention on initial index reading in 
 TermInfosReader ensureIndexIsRead 
 ---

 Key: LUCENE-1609
 URL: https://issues.apache.org/jira/browse/LUCENE-1609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
 Environment: Solr 
 Tomcat 5.5
 Ubuntu 2.6.20-17-generic
 Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
Reporter: Dan Rosher
 Attachments: LUCENE-1609.patch


 synchronized method ensureIndexIsRead in TermInfosReader causes contention 
 under heavy load
 Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
 range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
 docs) and under a load/stress test application, and later, examining the 
 Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
 entry' to this method.
 Rather than using Double-Checked Locking which is known to have issues, this 
 implementation uses a state pattern, where only one thread can move the 
 object from IndexNotRead state to IndexRead, and in doing so alters the 
 objects behavior, i.e. once the index is loaded, the index nolonger needs a 
 synchronized method. 
 In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702404#action_12702404
 ] 

Michael McCandless commented on LUCENE-1609:


If it's only for segment merging, couldn't we up front conditionalize the 
loading of the index?

 Eliminate synchronization contention on initial index reading in 
 TermInfosReader ensureIndexIsRead 
 ---

 Key: LUCENE-1609
 URL: https://issues.apache.org/jira/browse/LUCENE-1609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
 Environment: Solr 
 Tomcat 5.5
 Ubuntu 2.6.20-17-generic
 Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
Reporter: Dan Rosher
 Attachments: LUCENE-1609.patch


 synchronized method ensureIndexIsRead in TermInfosReader causes contention 
 under heavy load
 Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
 range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
 docs) and under a load/stress test application, and later, examining the 
 Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
 entry' to this method.
 Rather than using Double-Checked Locking which is known to have issues, this 
 implementation uses a state pattern, where only one thread can move the 
 object from IndexNotRead state to IndexRead, and in doing so alters the 
 objects behavior, i.e. once the index is loaded, the index nolonger needs a 
 synchronized method. 
 In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Michael McCandless
OK that looks right.

Though, maybe you should add javadocs to FieldValueHitQueue saying it
does *not* do AUTO resolution?  (Since we've deprecated FSHQ in favor
of FVHQ, I think we should call out this difference?).  And state the
workaround (calling FVHQ.detectFieldType yourself)?

Mike

On Fri, Apr 24, 2009 at 8:25 AM, Mark Miller markrmil...@gmail.com wrote:
 I'm just putting the auto resolution back in fshq and there is no double
 check - foolish me. As I first said, we will either resolve first in back
 compat mode and it will be skipped in fshq, or it will be resolved in fshq
 for anyone using it externally (and not resolving first on there own). I'll
 commit the change shortly.

 Mark Miller wrote:

 Eh - we have to spin through all the fields to check for legacy first
 anyway. Just doing it twice in legacy mode is probably best? I think there
 is no way to avoid spinning through the fields twice in any case (you have
 to spin through the sort fields to know you dont want to spin through the
 sort fields), so I guess I just go with that.

 - Mark

 Mark Miller wrote:

 Ah, right - thats basically what I was suggesting, though I was thinking
 that we might need to resolve twice, but of course your right and we don't
 have to. Lucene does still use fshq for back compat (for custom field source
 I think?), but I can just do the auto resolution in IndexSearcher only in
 non legacy mode and restore auto resolution to fshq. That will avoid the
 harmless but wasteful double resolve in legacy mode. Had no code to look at
 when I sent that out.

 - Mark

 Michael McCandless wrote:

 Urgh, right.  Can't we simply restore the AUTO resolution into it?
 Existing (direct) usage of it must be passing in the top-level
 IndexReader.  (Lucene doesn't use FSHQ internally).

 Mike

 On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com
 wrote:


 Just got off the train and ny to ct has a brilliant bar car, so lest I
 forget:

 1483 moved auto resolution from fshq to indexsearcher - which is a back
 compat break if you were using a fshq without indexsearcher (Solr does
 it -
 anyone could). Annoying. If I remember right, I did it to resolve auto
 on
 the multireader rather than each individual segment reader. So the
 change is
 needed and not allowed. Perhaps it could just re-resolve like before
 though
 - if indexsearcher has already resolved, fine, otherwise it will be
 done
 again at the fshq level. Ill issue it up later.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org








 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702417#action_12702417
 ] 

Michael McCandless commented on LUCENE-1593:



bq. I tried to move to pre-populate the queue in TFC, but that proved to be 
impossible (unless I missed something)

I think it should work fine, for most types, because we'd set docID
(the tie breaker) to Integer.MAX_VALUE.  No special additional if is
then required, since that entry would always compare at the bottom?

For String we should be able to use U+.

bq. So I debugged it and either the test works fine but there's a bug in 
MultiSearcher, or the test does not work fine (or should be adjusted) but then 
we'll be changing runtime behavior of Sort (e.g., everyone who used setSort 
might get undesired behavior, only in MultiSearcher).

Hmm -- good catch!  I think this is in fact a bug in MultiSearcher
(AEB is the right answer): it's failing to break ties (by docID)
properly.  Ie, it will not match what
IndexSearcher(MultiSegmentReader(...)) will do.

I think we could fix this by allowing one to pass in a docbase when
searching?  Though that's a more involved change... could you open a
new issue for this one?

bq.  I think that's wrong in the first place, we should always have predicted 
tests output, since we're not involving any randomness in that code.

I agree -- let's fix it to be a deterministic test?


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 1483 and Auto resolution

2009-04-24 Thread Mark Miller

Will do.

Michael McCandless wrote:

OK that looks right.

Though, maybe you should add javadocs to FieldValueHitQueue saying it
does *not* do AUTO resolution?  (Since we've deprecated FSHQ in favor
of FVHQ, I think we should call out this difference?).  And state the
workaround (calling FVHQ.detectFieldType yourself)?

Mike

On Fri, Apr 24, 2009 at 8:25 AM, Mark Miller markrmil...@gmail.com wrote:
  

I'm just putting the auto resolution back in fshq and there is no double
check - foolish me. As I first said, we will either resolve first in back
compat mode and it will be skipped in fshq, or it will be resolved in fshq
for anyone using it externally (and not resolving first on there own). I'll
commit the change shortly.

Mark Miller wrote:


Eh - we have to spin through all the fields to check for legacy first
anyway. Just doing it twice in legacy mode is probably best? I think there
is no way to avoid spinning through the fields twice in any case (you have
to spin through the sort fields to know you dont want to spin through the
sort fields), so I guess I just go with that.

- Mark

Mark Miller wrote:
  

Ah, right - thats basically what I was suggesting, though I was thinking
that we might need to resolve twice, but of course your right and we don't
have to. Lucene does still use fshq for back compat (for custom field source
I think?), but I can just do the auto resolution in IndexSearcher only in
non legacy mode and restore auto resolution to fshq. That will avoid the
harmless but wasteful double resolve in legacy mode. Had no code to look at
when I sent that out.

- Mark

Michael McCandless wrote:


Urgh, right.  Can't we simply restore the AUTO resolution into it?
Existing (direct) usage of it must be passing in the top-level
IndexReader.  (Lucene doesn't use FSHQ internally).

Mike

On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com
wrote:

  

Just got off the train and ny to ct has a brilliant bar car, so lest I
forget:

1483 moved auto resolution from fshq to indexsearcher - which is a back
compat break if you were using a fshq without indexsearcher (Solr does
it -
anyone could). Annoying. If I remember right, I did it to resolve auto
on
the multireader rather than each individual segment reader. So the
change is
needed and not allowed. Perhaps it could just re-resolve like before
though
- if indexsearcher has already resolved, fine, otherwise it will be
done
again at the fshq level. Ill issue it up later.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  

  

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Fuzzy search optimization

2009-04-24 Thread Michael McCandless
Please do!

Mike

On Thu, Apr 23, 2009 at 7:13 AM, Varun Dhussa va...@mapmyindia.com wrote:
 Hi,

 I was going through the Levenshtein distance code in
 org.apache.lucene.search.FuzzyTermEnum.java of the 2.4.1 build. I
 noticed that there can be a small, but effective optimization to the
 distance calculation code (initialization). I have the code ready with
 me. I can post it if anyone is interested.

 Thanks and regards
 Varun Dhussa
 Product Architect
 CE InfoSystems (P) Ltd.
 http://maps.mapmyindia.com


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702418#action_12702418
 ] 

Michael McCandless commented on LUCENE-1593:



bq. Actually, I think BooleanScorer need not process docs out of order? The 
only out of order ness seems to come from how it appends each new Bucket to 
the head of the linked list; if instead it appended to the tail, the collector 
would see docs arrive in order. I think?

I was wrong about this -- you also get out of order ness due to
2nd, 3rd, etc. clauses in the boolean query adding in new docs.
Re-ordering those will be costly.

But: in the initial collection of each chunk, we do know that any
docID in the queue will be less than those being visited, so this
could be a good future optimization for immediately ruling out docs
during TermScorer.  That's a larger change... eg it'd require being
able to say I don't need exact total hit count for this search.


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702436#action_12702436
 ] 

Michael McCandless commented on LUCENE-1539:



{quote}
Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?
{quote}

How about randomly choosing docs to delete instead of every N?  Then
you don't need to keep track?

{quote}
 I don't think we can relax that. This (single transaction
 (writer) open at once) is a core assumption in Lucene.

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp
{quote}

We'd need to figure out how to get multiple writers to properly
cooperate.  Actually Marvin is working on something like this (for
KS/Lucy), where one lightweight writer can do adds/deletes/small
merges, and a separate heavyweight writer does large merges.


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702448#action_12702448
 ] 

Shai Erera commented on LUCENE-1593:


bq. I think it should work fine, for most types, because we'd set docID (the 
tie breaker) to Integer.MAX_VALUE. No special additional if is then required, 
since that entry would always compare at the bottom?

That's not so simple. Let's say I initialize the sentinel of IntComp to 
Integer.MAX_VALUE. That should have guaranteed that any 'int'  MAX_VAL would 
compare better. But the code in TFC compares the current doc against the 
'bottom'. For all Sentinels, it means MAX_VAL. If the input doc's val  
MAX_VAL, it compares better. Otherwise, it is rejected, because:
# If it is bigger than the bottom, it should be rejected.
# If it equals, it's also rejected, since now that we move to returning docs in 
order, it is assumed that this doc's doc Id is greater than whatever is in the 
queue, and so it's rejected.
Actually, the tie is broken only after it's in queue, when the latter calls 
compare(). This assumption removed the 'if' that checked for doc Id value, so 
if I reinstate it, we don't really gain anything, right (replacing 'if 
(queueFull)' with 'if (bottom.doc  doc + docBase)')?

bq. For String we should be able to use U+.

If we resolve the core issues with sentinel values, than this will be the value 
I'd use for Strings, right.

bq. I think we could fix this by allowing one to pass in a docbase when 
searching?

I actually would like to propose the following: MultiSearcher already fixes the 
FD.doc before inserting it to the queue. I can do the same for the 
FieldDoc.fields() value, in case one of the fields is FIELD_DOC. This can 
happen just after searcher.search() returns, and before MS adds the results to 
its own FieldDocSortedHitQueue. I already did it, and all the testMultiSearch 
cases fail, but that's because they just assert that the bug exists :).
If you think a separate issue is still required, I can do it, but that would 
mean that the tests will fail until we fix it, or I don't change Sort in this 
issue and do it as part of the other one.

bq. let's fix it to be a deterministic test?

Will do, but it depends - if a new issue is required, I'll do it there.

bq. I was wrong about this

I must say I still didn't fully understand what do you mean here. I intended to 
keep that to after everything else will work in that issue's scope, and note if 
there are any tests that fail, or BQ actually behaves properly. So I'll simply 
count on what you say is true :), since I'm not familiar with that code.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702461#action_12702461
 ] 

Michael McCandless commented on LUCENE-1516:


bq. We need an IndexWriter.getMergedSegmentWarmer method?

Yes, I just committed; thanks!

 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702471#action_12702471
 ] 

Uwe Schindler commented on LUCENE-1575:
---

Hi,
Shalin found a backwards-incompatible change in the Searcher abstract class, I 
noticed this from his SVN comment for SOLR-940 (where he updated to Lucene 
trunk):
{code}
abstract public void search(Weight weight, Filter filter, Collector results) 
throws IOException;
{code}
This should not be abstract for backwards compatibility, but instead throw an 
UnsupportedOperationException or have a default implementation that somehow 
wraps the Collector using an old HitCollector (not very nice, I do not know how 
to fix this in any other way). Before 3.0, where this change would be ok,  the 
Javadocs should note, that the deprecated HitCollector API will be removed and 
the Collector part will be made abstract.
If this method stays abstract, you cannot compile old code or replace lucene 
jars (this is seldom, as almost nobody creates private implementations of 
Searcher, but Solr does...

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
 LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, 
 PerfTest.java, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call 

[jira] Reopened: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-24 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened LUCENE-1575:
---


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
 LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, 
 PerfTest.java, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702488#action_12702488
 ] 

Shai Erera commented on LUCENE-1575:


If you read somewhere above, you'll see that we've discussed this change. It 
seems that whatever we do, anyone who upgrades to 2.9 will need to touch his 
code. If you extend Searcher, you'll need to override that new method, 
regardless of what we choose to do. That's because if it's abstract, you need 
to implement it, and it it's concrete (throwing UOE), you'll need to override 
it since all the Searcher methods call this one at the end.

I'm not sure wrapping a Collector with HitCollector will work, because all of 
the other classes now expect Collector, and their HitCollector variants call 
the Collector one with a HitCollectorWrapper (which is also deprecated).

We agreed that making it abstract is the lesser of all evils, since you'll spot 
it at compile time, rather than at runtime, when you'll hit a UOE.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
 LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, 
 PerfTest.java, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think 

Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Shai Erera
Hi

I think we can make some optimization to DocIdSetIterator. Today, it defines
next() and skipTo(int) which return a boolean. I've checked the code and it
looks like almost always when these two are called, they are followed by a
call to doc().

I was thinking that if those two returned the doc Id they are at, instead of
boolean, that will save the call to doc(). Those that use these can:
* Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there
are no more docs in this iterator.
* If skipTo() is called, compare the 'target' to the returned Id, and if
they are not the same, save it so that the next skipTo is requested, they
don't perform it if the returned Id is greater than the target. If it's not
possible to save it, they can call doc() to get that information.

The way I see it, the impls that will still need to call doc() will lose
nothing. All we'll do is change the 'if' from comparing a boolean to
comparing ints (even though that's a bit less efficient than comparing
booleans). The impls that call doc() just because all they have in hand is a
boolean, will gain.

Obviously we can't change those methods' signature, so we can deprecate them
and intorudce nextDoc() and skipToDoc(int target). We should still keep
doc() around though.

What do you think? If you agree to this change, I can add them to 1593, or
create a new issue (I prefer the latter so that 1593 will be focused on the
changes to Collectors).

Shai


[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702496#action_12702496
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

For this patch I'm debating whether to add a package protected
IndexWriter.addIndexWriter method. The problem is, the RAMIndex
blocks on the write to disk during IW.addIndexesNoOptimize which
if we're using ConcurrentMergeScheduler shouldn't happen?
Meaning in this proposed solution, if segments keep on piling up
in RAMIndex, we simply move them over to the disk IW which will
in the background take care of merging them away and to disk.

I don't think it's necessary to immediately write ram segments
to disk (like the current patch does), instead it's possible to
simply copy segments over from the incoming IW, leave them in
RAM and they can be merged to disk as necessary? Then on
IW.flush any segmentinfo(s) that are not from the current
directory can be flushed to disk? 

Just thinking out loud about this.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Mark Miller

Shai Erera wrote:

Hi

I think we can make some optimization to DocIdSetIterator. Today, it 
defines next() and skipTo(int) which return a boolean. I've checked 
the code and it looks like almost always when these two are called, 
they are followed by a call to doc().


I was thinking that if those two returned the doc Id they are at, 
instead of boolean, that will save the call to doc(). Those that use 
these can:
* Compare doc to a NO_MORE_DOCS constant (set to -1), to understand 
there are no more docs in this iterator.
* If skipTo() is called, compare the 'target' to the returned Id, and 
if they are not the same, save it so that the next skipTo is 
requested, they don't perform it if the returned Id is greater than 
the target. If it's not possible to save it, they can call doc() to 
get that information.


The way I see it, the impls that will still need to call doc() will 
lose nothing. All we'll do is change the 'if' from comparing a boolean 
to comparing ints (even though that's a bit less efficient than 
comparing booleans). The impls that call doc() just because all they 
have in hand is a boolean, will gain.


Obviously we can't change those methods' signature, so we can 
deprecate them and intorudce nextDoc() and skipToDoc(int target). We 
should still keep doc() around though.


What do you think? If you agree to this change, I can add them to 
1593, or create a new issue (I prefer the latter so that 1593 will be 
focused on the changes to Collectors).


Shai

any micro benchmarks or anything? If its a net real world win, +1.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Mark Miller

Mark Miller wrote:

Shai Erera wrote:

Hi

I think we can make some optimization to DocIdSetIterator. Today, it 
defines next() and skipTo(int) which return a boolean. I've checked 
the code and it looks like almost always when these two are called, 
they are followed by a call to doc().


I was thinking that if those two returned the doc Id they are at, 
instead of boolean, that will save the call to doc(). Those that use 
these can:
* Compare doc to a NO_MORE_DOCS constant (set to -1), to understand 
there are no more docs in this iterator.
* If skipTo() is called, compare the 'target' to the returned Id, and 
if they are not the same, save it so that the next skipTo is 
requested, they don't perform it if the returned Id is greater than 
the target. If it's not possible to save it, they can call doc() to 
get that information.


The way I see it, the impls that will still need to call doc() will 
lose nothing. All we'll do is change the 'if' from comparing a 
boolean to comparing ints (even though that's a bit less efficient 
than comparing booleans). The impls that call doc() just because all 
they have in hand is a boolean, will gain.


Obviously we can't change those methods' signature, so we can 
deprecate them and intorudce nextDoc() and skipToDoc(int target). We 
should still keep doc() around though.


What do you think? If you agree to this change, I can add them to 
1593, or create a new issue (I prefer the latter so that 1593 will be 
focused on the changes to Collectors).


Shai

any micro benchmarks or anything? If its a net real world win, +1.


P.S. I'd make a new issue myself.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702503#action_12702503
 ] 

Michael McCandless commented on LUCENE-1593:


bq. If it equals, it's also rejected, since now that we move to returning docs 
in order, it is assumed that this doc's doc Id is greater than whatever is in 
the queue, and so it's rejected.

Argh, you are right... so the approach will fail if any of the first topN (eg 
10) hits have a field value equal to the sentinel value.  I guess we could do 
two separate passes: a startup pass (while queue is filling up) and a the 
rest pass that knows the queue is full.  But that's getting rather ugly; 
probably we should leave this optimization to source code specialization.

bq.  I can do the same for the FieldDoc.fields() value, in case one of the 
fields is FIELD_DOC.

Excellent!  Let's just do this as part of this issue.

bq. I must say I still didn't fully understand what do you mean here.

I also did not understand how BooleanScorer works until Doug explained it (see 
the comment at the top).

Right now it gathers hits of each clause in a reversed linked list, which it 
then makes a 2nd pass to collect.  So the Collector will see docIDs in reverse 
order for that clause.  I thought we could simply fix the linking to be forward 
and we'd have docIDs in order.  But that isn't quite right because any new 
docIDs hit by the 2nd clause will be inserted at the end of the linked list, 
out of order.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Michael McCandless
I think this is a good idea!  I think a new issue is best.

Mike

On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote:
 Hi

 I think we can make some optimization to DocIdSetIterator. Today, it defines
 next() and skipTo(int) which return a boolean. I've checked the code and it
 looks like almost always when these two are called, they are followed by a
 call to doc().

 I was thinking that if those two returned the doc Id they are at, instead of
 boolean, that will save the call to doc(). Those that use these can:
 * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there
 are no more docs in this iterator.
 * If skipTo() is called, compare the 'target' to the returned Id, and if
 they are not the same, save it so that the next skipTo is requested, they
 don't perform it if the returned Id is greater than the target. If it's not
 possible to save it, they can call doc() to get that information.

 The way I see it, the impls that will still need to call doc() will lose
 nothing. All we'll do is change the 'if' from comparing a boolean to
 comparing ints (even though that's a bit less efficient than comparing
 booleans). The impls that call doc() just because all they have in hand is a
 boolean, will gain.

 Obviously we can't change those methods' signature, so we can deprecate them
 and intorudce nextDoc() and skipToDoc(int target). We should still keep
 doc() around though.

 What do you think? If you agree to this change, I can add them to 1593, or
 create a new issue (I prefer the latter so that 1593 will be focused on the
 changes to Collectors).

 Shai


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-24 Thread Michael McCandless
George, did you mean LUCENE-1516 below?  (LUCENE-1313 is a further
improvement to near real-time search that's still being iterated on).

In general I would say 2.9 seems to be in rather active development still ;)

I too would love to hear about production/beta use of 2.9.  George
maybe you should re-ask on java-user?

Mike

On Sat, Apr 18, 2009 at 7:12 PM, George Aroush geo...@aroush.net wrote:
 Thanks all for your input on this subject.

 So, if I decide to grab the current code off the trunk, is it:

 1) Usable for production use?
 2) Is LUCENE-1313 (Realtime search), in the current trunk, stable and ready
 for use?

 Put another way, is anyone using the current trunk code in production, or
 even as beta?

 -- George

 
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: Thursday, April 16, 2009 5:13 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Lucene 2.9 status (to port to Lucene.Net)

 LUCENE-1313 relies on LUCENE-1516 which is in trunk.  If you have other
 questions George, feel free to ask.

 On Thu, Apr 16, 2009 at 8:04 AM, George Aroush geo...@aroush.net wrote:

 Thanks Mike.

 A quick follow up question.  What's the status of
 http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be
 applied
 to Lucene 2.4.1 and still get it's benefit or are there other dependency /
 issues with it that prevents us from doing so?

 If anyone else knows, I welcome your input.

 -- George

  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: Thursday, April 16, 2009 8:36 AM
  To: java-dev@lucene.apache.org
  Subject: Re: Lucene 2.9 status (to port to Lucene.Net)
 
  Hi George,
 
  There's been a sudden burst of activity lately on 2.9 development...
 
  I know there are some biggish remaining features we may want
  to get into 2.9:
 
    * The new field cache (LUCENE-831; still being iterated/mulled),
 
    * Possible major rework of Field / Document  index-time vs
      search-time Document
 
    * Applying filters via random-access API when possible  performant
      (LUCENE-1536)
 
    * Possible further optimizations to how collection works
     (LUCENE-1593)
 
    * Maybe breaking core + contrib into a more uniform set of modules
      (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
      -- the Modularization uber-thread.
 
    * Further improvements to near-realtime search (using RAMDir for
      small recently flushed segments)
 
    * Many other small things and probably some big ones that I'm
      forgetting now :)
 
  So things are still in flux, and I'm really not sure on a
  release date at this point.  Late last year, I was hoping for
  early this year, but it's no longer early this year ;)
 
  Mike
 
  On Wed, Apr 15, 2009 at 9:17 PM, George Aroush
  geo...@aroush.net wrote:
   Hi Folks,
  
   This is George Aroush, I'm one of the committers on Lucene.Net - a
   port of Java Lucene to C# Lucene.
  
   I'm looking at the current trunk code of yet to be released
  Lucene 2.9
   and I would like to port it to Lucene.Net.  If I do this
  now, we get
   the benefit of keeping our code base and release dates much
  closer to Java Lucene.
   However, this comes with a cost of carrying over unfinished work,
   known defects, and I have to keep an eye on new code that get
   committed into Java Lucene which must be ported over in a
  timely fashion.
  
   To help me determine when is a good time to start the port
  -- keep in
   mind, I will be taking the latest code off SVN -- I like to
  hear from
   the Java Lucene committers (and users who are playing or
  using Lucene
   2.9 off SVN) about those questions:
  
   1) how stable the current code in the trunk is,
   2) do you still have feature work to deliver or just bug fixes, and
   3) what's your target date to release Java Lucene 2.9
  
   #1 is important, such that is anyone using it in production?
  
   Yes, I did look at the current open issues in JIRA, but
  that doesn't
   help me answer the above questions.
  
   Regards,
  
   -- George
  
  
  
  -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702509#action_12702509
 ] 

Michael McCandless commented on LUCENE-1575:


bq. Shalin found a backwards-incompatible change in the Searcher abstract class

We could go either way on this... the evils were strong with either choice, and 
we struggled and eventually went with adding abstract method today, for the 
reasons Shai enumerated.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
 LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, 
 PerfTest.java, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to 

[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702515#action_12702515
 ] 

Michael McCandless commented on LUCENE-1313:



bq. I don't think it's necessary to immediately write ram segments to disk

I agree: it should be fine from IndexWriter's standpoint if some
segments live in a private RAMDir and others live in the real dir.

In fact, early versions of LUCENE-843 did exactly this: IW's RAM
buffer is not as efficient as a written segment, and so you can gain
some RAM efficiency by flushing first to RAM and then merging to disk.

I think we could adopt a simple criteria: you flush the new segment to
the RAM Dir if net RAM used is  maxRamBufferSizeMB.  This way no
further configuration is needed.  On auto-flush triggering you then
must take into account the RAM usage by this RAM Dir.  On commit,
these RAM segments must be migrated to the real dir (preferably by
forcing a merge, somehow).

A near realtime reader would also happily mix real Dir and RAMDir
SegmentReaders.

This should work well I think, and should not require a separate
RAMIndex class, and won't block things when the RAM segments are
migrated to disk by CMS.


 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-24 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1575.
---

Resolution: Fixed

OK, I resolve the issue again. I was just wondering and wanted to be sure. I 
also missed the first entry in CHANGES.txt, which explains it.
It is the same like with the Fieldable interface in the past, it is seldom 
implemented/overwritten and so the normal user will not be affected. And 
those who implement Fieldable or extend Searcher must implement it.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
 LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, 
 PerfTest.java, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically 

RE: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Uwe Schindler
Maybe combine this with the isRandomAccess change it DocIdSet?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, April 24, 2009 9:46 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Another possible optimization - now in DocIdSetIterator
 
 I think this is a good idea!  I think a new issue is best.
 
 Mike
 
 On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I think we can make some optimization to DocIdSetIterator. Today, it
 defines
  next() and skipTo(int) which return a boolean. I've checked the code and
 it
  looks like almost always when these two are called, they are followed by
 a
  call to doc().
 
  I was thinking that if those two returned the doc Id they are at,
 instead of
  boolean, that will save the call to doc(). Those that use these can:
  * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand
 there
  are no more docs in this iterator.
  * If skipTo() is called, compare the 'target' to the returned Id, and if
  they are not the same, save it so that the next skipTo is requested,
 they
  don't perform it if the returned Id is greater than the target. If it's
 not
  possible to save it, they can call doc() to get that information.
 
  The way I see it, the impls that will still need to call doc() will lose
  nothing. All we'll do is change the 'if' from comparing a boolean to
  comparing ints (even though that's a bit less efficient than comparing
  booleans). The impls that call doc() just because all they have in hand
 is a
  boolean, will gain.
 
  Obviously we can't change those methods' signature, so we can deprecate
 them
  and intorudce nextDoc() and skipToDoc(int target). We should still keep
  doc() around though.
 
  What do you think? If you agree to this change, I can add them to 1593,
 or
  create a new issue (I prefer the latter so that 1593 will be focused on
 the
  changes to Collectors).
 
  Shai
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-24 Thread Shon Vella (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shon Vella updated LUCENE-1604:
---

Attachment: LUCENE-1604.patch

Updated patch that preserves disableNorms flag across clone and reopen() and 
applies flag transitively to MultiSegmentReader.

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702520#action_12702520
 ] 

Michael McCandless commented on LUCENE-1252:


Right, I think this is more about determining whether a doc is a hit or not, 
than about how to compute its score.

I think somehow the scorer needs to return 2 scorers that share the underlying 
iterators.  The first scorer simply checks AND-ness with all other required 
terms, and only if the doc passes those are the 
positional/payloads/anything-else-expensive consulted.

 Avoid using positions when not all required terms are present
 -

 Key: LUCENE-1252
 URL: https://issues.apache.org/jira/browse/LUCENE-1252
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Paul Elschot
Priority: Minor

 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
 currently next() and skipTo() will use position information even when other 
 parts of the query cannot match because some required terms are not present.
 This could be avoided by adding some methods to Scorer that relax the 
 postcondition of next() and skipTo() to something like all required terms 
 are present, but no position info was checked yet, and implementing these 
 methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
 SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Michael McCandless
On the it touches the same code criteria, I would agree.

On the it's the same core problem criteria, I would disagree.

Also I would think this change is simpler than the isRandomAccess
addition and so probably would land before isRandomAccess... so I
think I'd lean towards keeping them as separate issues.

Mike

On Fri, Apr 24, 2009 at 4:06 PM, Uwe Schindler u...@thetaphi.de wrote:
 Maybe combine this with the isRandomAccess change it DocIdSet?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, April 24, 2009 9:46 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Another possible optimization - now in DocIdSetIterator

 I think this is a good idea!  I think a new issue is best.

 Mike

 On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I think we can make some optimization to DocIdSetIterator. Today, it
 defines
  next() and skipTo(int) which return a boolean. I've checked the code and
 it
  looks like almost always when these two are called, they are followed by
 a
  call to doc().
 
  I was thinking that if those two returned the doc Id they are at,
 instead of
  boolean, that will save the call to doc(). Those that use these can:
  * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand
 there
  are no more docs in this iterator.
  * If skipTo() is called, compare the 'target' to the returned Id, and if
  they are not the same, save it so that the next skipTo is requested,
 they
  don't perform it if the returned Id is greater than the target. If it's
 not
  possible to save it, they can call doc() to get that information.
 
  The way I see it, the impls that will still need to call doc() will lose
  nothing. All we'll do is change the 'if' from comparing a boolean to
  comparing ints (even though that's a bit less efficient than comparing
  booleans). The impls that call doc() just because all they have in hand
 is a
  boolean, will gain.
 
  Obviously we can't change those methods' signature, so we can deprecate
 them
  and intorudce nextDoc() and skipToDoc(int target). We should still keep
  doc() around though.
 
  What do you think? If you agree to this change, I can add them to 1593,
 or
  create a new issue (I prefer the latter so that 1593 will be focused on
 the
  changes to Collectors).
 
  Shai
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Lucene 2.9 status (to port to Lucene.Net)

2009-04-24 Thread Uwe Schindler
 George, did you mean LUCENE-1516 below?  (LUCENE-1313 is a further
 improvement to near real-time search that's still being iterated on).
 
 In general I would say 2.9 seems to be in rather active development still
 ;)
 
 I too would love to hear about production/beta use of 2.9.  George
 maybe you should re-ask on java-user?

Here! I updated www.pangaea.de to Lucene-trunk today (because of incomplete
hashcode in TrieRangeQuery)... Works perfect, but I do not use the realtime
parts. And 10 days before the same, no problems :-)

Currently I rewrite parts of my code to Collector to go away from
HitCollector (without score, so optimizations)! The reopen() and sorting is
fine, almost no time is consumed for sorted searches after reopening indexes
every 20 minutes with just some new and small segments with changed
documents. No extra warming is needed.

Another change to be done here is Field.Store.COMPRESS and replace by
manually compressed binary stored fields, but this is only to get rid of the
deprecated warnings. But this cannot be done without complete reindexing.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Marvin Humphrey
On Fri, Apr 24, 2009 at 10:26:21PM +0300, Shai Erera wrote:

 I think we can make some optimization to DocIdSetIterator. Today, it defines
 next() and skipTo(int) which return a boolean. I've checked the code and it
 looks like almost always when these two are called, they are followed by a
 call to doc().
 
 I was thinking that if those two returned the doc Id they are at, instead of
 boolean, that will save the call to doc(). 

 What do you think? 

It'll work.

Nathan Kurz proposed exactly this change for KinoSearch last July. 

http://rectangular.com/pipermail/kinosearch/2007-July/004149.html

I think there is a small gain by having Scorer_Advance return a doc number
directly rather than a boolean, obviating the need for a follow-up call to
Scorer_Doc.

I finished the implementation last October.

One additional wrinkle, though: doc nums start at 1 rather than 0, so the
return values for Next() and Advance() can double as a booleans.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702526#action_12702526
 ] 

Michael McCandless commented on LUCENE-831:
---

{quote}
Grandma! But yeah we need to somehow support probably plain Java
objects rather than every primitive derivative?
{quote}

You mean big arrays (one per doc) of plain-java-objects?  Is Bobo doing that 
today?  Or do you mean a single Java obect that, internally, deals with lookup 
by docID?

{quote}
(In reference to Mark's post 2nd to last post) Bobo efficiently
nicely calculates facets for multiple values per doc which is
the same thing as multi value faceting?
{quote}

Neat.  How do you compactly represent (in RAM) multiple values per doc?

{quote}
Are norms and deletes implemented? These would just be byte
arrays in the current approach? If not how would they be
represented? It seems like for deleted docs we'd want the
BitVector returned from a ValueSource.get type of method?
{quote}

The current patch doesn't do this -- but we should think about how this change 
could absorb norms/deleted docs, in the future.  We would add a bit variant 
of getXXX (eg that returns BitVector, BitSet, something).

{quote}
Hmm... Does this mean we'd replace the current IndexReader
method of performing updates on norms and deletes with this more
generic update mechanism?
{quote}

Probably we'd still leave the sugar APIs in place, but under the hood their 
impls would be switched to this.

bq. It would be cool to get CSF going?

Most definitely!!

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: CHANGES.txt

2009-04-24 Thread Michael McCandless
On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu wrote:
 A few random observations about CHANGES.txt and the generated CHANGES.html:

 - The ü in Christian Kohlschütter's name is not proper UTF-8 (maybe it's 
 double-encoded or something) in the two LUCENE-1186 mentions in the Trunk 
 section, though it looks okay in the LUCENE-1186 mention in the 2.4.1 release 
 section.  (Déjà-vu all over again, non?)

OK I committed this fix.  Yes, definitely déjà-vu!

 - Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are mentioned in both 
 the 2.4.1 section and in the Trunk section.  AFAICT, it has not been standard 
 practice to mention bug fixes on a major or minor release (which Trunk will 
 become) if they are mentioned on a prior patch release.

Hmm -- I thought it'd be good to be clear on which bugs were fixed,
where, even if it causes some redundancy?

 - The perl script that generates Changes.html (changes2html.pl) finds list 
 items using a regex like /\s*\d+\.\s+/, but there is one list item in 
 CHANGES.txt (#4 under Bug fixes in the Trunk section, for LUCENE-1453) that 
 doesn't match this regex, since it's missing the trailing period ( 4 ), and 
 so it's interpreted as just another paragraph in the previous list item.  To 
 fix this, either the regex should be changed, or  4  should be changed to  
 4..  (I prefer the latter, since this is the only occurrence, and it has 
 never been part of a release.)

I committed this fix.

 - The Trunk section sports use of a new feature: code sections, for the two 
 mentions of LUCENE-1575.  This looks fine in the text rendering, but looks 
 crappy in the HTML version, since changes2html.pl escapes HTML metacharacters 
 to appear as-is in the HTML rendering, but the newlines in the code are 
 converted to a single space.  I think this should be fixed by modifying 
 changes2html.pl to convert code and /code into (unescaped) codepre 
 and /pre/code, respectively, since just passing through code and 
 /code, without /?pre, while changing the font to monospaced (nice), still 
 collapses whitespace (not nice).  (There is a related question: should all 
 HTML tags in CHANGES.txt be passed through without being escaped?  I don't 
 think so; better to handle them on a case-by-case basis, as the need arises.)

Can you make a patch for codepre.../pre/code?  (I like that
approach).  I agree let's not make it generic to all HTML tags for
now...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl

2009-04-24 Thread Steven Rowe (JIRA)
Preserve whitespace in code sections in the Changes.html generated from 
CHANGES.txt by changes2html.pl


 Key: LUCENE-1610
 URL: https://issues.apache.org/jira/browse/LUCENE-1610
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.9
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 2.9


The Trunk section of CHANGES.txt sports use of a new feature: code sections, 
for the two mentions of LUCENE-1575.

This looks fine in the text rendering, but looks crappy in the HTML version, 
since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML 
rendering, but the newlines in the code are converted to a single space. 

I think this should be fixed by modifying changes2html.pl to convert code and 
/code into (unescaped) codepre and /pre/code, respectively, since 
just passing through code and /code, without /?pre, while changing the 
font to monospaced (nice), still collapses whitespace (not nice). 

See the java-dev thread that spawned this issue here: 
http://www.nabble.com/CHANGES.txt-td23102627.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Michael McCandless
On Fri, Apr 24, 2009 at 4:20 PM, Marvin Humphrey mar...@rectangular.com wrote:

 One additional wrinkle, though: doc nums start at 1 rather than 0, so the
 return values for Next() and Advance() can double as a booleans.

Meaning they return 0 to indicate no more docs?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl

2009-04-24 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1610:


Attachment: LUCENE-1610.patch

Implements the suggested fix: code is converted to codepre (instead of to 
amp;lt;codeamp;gt; ) and /code is converted to /pre/code (instead of to 
amp;lt;/codeamp;gt; )

 Preserve whitespace in code sections in the Changes.html generated from 
 CHANGES.txt by changes2html.pl
 

 Key: LUCENE-1610
 URL: https://issues.apache.org/jira/browse/LUCENE-1610
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.9
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1610.patch


 The Trunk section of CHANGES.txt sports use of a new feature: code 
 sections, for the two mentions of LUCENE-1575.
 This looks fine in the text rendering, but looks crappy in the HTML version, 
 since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML 
 rendering, but the newlines in the code are converted to a single space. 
 I think this should be fixed by modifying changes2html.pl to convert code 
 and /code into (unescaped) codepre and /pre/code, respectively, 
 since just passing through code and /code, without /?pre, while 
 changing the font to monospaced (nice), still collapses whitespace (not 
 nice). 
 See the java-dev thread that spawned this issue here: 
 http://www.nabble.com/CHANGES.txt-td23102627.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Shai Erera
But I thought doc Ids start with 0? That's why I wrote 'set to -1' ...

On Sat, Apr 25, 2009 at 12:18 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Fri, Apr 24, 2009 at 4:20 PM, Marvin Humphrey mar...@rectangular.com
 wrote:

  One additional wrinkle, though: doc nums start at 1 rather than 0, so the
  return values for Next() and Advance() can double as a booleans.

 Meaning they return 0 to indicate no more docs?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Marvin Humphrey
On Fri, Apr 24, 2009 at 05:18:30PM -0400, Michael McCandless wrote:

  One additional wrinkle, though: doc nums start at 1 rather than 0, so the
  return values for Next() and Advance() can double as a booleans.
 
 Meaning they return 0 to indicate no more docs?

Yes.  0 is our sentinel.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Marvin Humphrey
On Sat, Apr 25, 2009 at 12:24:59AM +0300, Shai Erera wrote:
 But I thought doc Ids start with 0? That's why I wrote 'set to -1' ...

This was in the context of KinoSearch.  In KS, doc numbers start at 1.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME

2009-04-24 Thread Michael McCandless (JIRA)
Do not launch new merges if IndexWriter has hit OOME


 Key: LUCENE-1611
 URL: https://issues.apache.org/jira/browse/LUCENE-1611
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9


if IndexWriter has hit OOME, it defends itself by refusing to commit changes to 
the index, including merges.  But this can lead to infinite merge attempts 
because we fail to prevent starting a merge.

Spinoff from 
http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Michael McCandless
It's a nice approach, but I think it relies on C interpreting integer
0 as false, which we can't do in Java.   (And, we lack unsigned int
in java so we have immense freedom to pick any negative number as our
sentinel ;) ).

Not to mention it'd be a scary change to make at this point!  So I
think we should just stick with -1 as our sentinel.

Mike

On Fri, Apr 24, 2009 at 5:33 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Fri, Apr 24, 2009 at 05:18:30PM -0400, Michael McCandless wrote:

  One additional wrinkle, though: doc nums start at 1 rather than 0, so the
  return values for Next() and Advance() can double as a booleans.

 Meaning they return 0 to indicate no more docs?

 Yes.  0 is our sentinel.

 Marvin Humphrey

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: CHANGES.txt

2009-04-24 Thread Steven A Rowe
Hi Mike,

On 4/24/2009 at 4:45 PM, Michael McCandless wrote:
 On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu wrote:
  - Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are mentioned
  in both the 2.4.1 section and in the Trunk section.  AFAICT, it has
  not been standard practice to mention bug fixes on a major or minor
  release (which Trunk will become) if they are mentioned on a prior
  patch release.
 
 Hmm -- I thought it'd be good to be clear on which bugs were fixed,
 where, even if it causes some redundancy?

Right: SUM(+1 clarity, -0.5 redundancy) = +0.5 :)

So the policy you're suggesting is: When backporting bug fixes from trunk to a 
patch version, make note of the change in both the trunk and patch version 
sections of CHANGES.txt, right?

Makes sense (though as I noted, this policy has never before been used), but 
why then did you include only 5 out of the 15 bug fixes listed under 2.4.1 in 
the Trunk section?

  - The Trunk section sports use of a new feature: code sections,
  for the two mentions of LUCENE-1575.  This looks fine in the text
  rendering, but looks crappy in the HTML version, since
  changes2html.pl escapes HTML metacharacters to appear as-is in
  the HTML rendering, but the newlines in the code are converted to
  a single space.  I think this should be fixed by modifying
  changes2html.pl to convert code and /code into (unescaped)
  codepre and /pre/code, respectively, since just passing
  through code and /code, without /?pre, while changing the
  font to monospaced (nice), still collapses whitespace (not nice).
  (There is a related question: should all HTML tags in CHANGES.txt
  be passed through without being escaped?  I don't think so;
  better to handle them on a case-by-case basis, as the need
  arises.)
 
 Can you make a patch for codepre.../pre/code?  (I like that
 approach).  I agree let's not make it generic to all HTML tags for
 now...

Done: https://issues.apache.org/jira/browse/LUCENE-1610

Steve


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702573#action_12702573
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote} I think we could adopt a simple criteria: you flush the
new segment to the RAM Dir if net RAM used is 
maxRamBufferSizeMB. This way no further configuration is needed.
On auto-flush triggering you then must take into account the RAM
usage by this RAM Dir. {quote}

So we're ok with the blocking that occurs when the ram buffer is
flushed to the ramdir? 

{quote}On commit, these RAM segments must be migrated to the
real dir (preferably by forcing a merge, somehow). {quote}

This is pretty much like resolveExternalSegments which would be
called in prepareCommit? This could make calls to commit much
more time consuming. It may be confusing to the user why
IW.flush doesn't copy the ram segments to disk.

{quote}A near realtime reader would also happily mix real Dir
and RAMDir SegmentReaders.{quote}

Agreed, however the IW.getReader MultiSegmentReader removes
readers from another directory so we'd need to add a new
attribute to segmentinfo that marks it as ok for inclusion in
the MSR?



 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread eks dev
Hi Shai, 
absolutely! 
we have been there, and there are already some micro benchmarks done in 
LUCENE-1345
just do not forget to use  -1  doc instead of -1 != doc, trust me, Yonik 
convinced me :)

as a side effect, this change would have some positive effects on iterator 
semantics, prevents, very  hard to find, one off bugs  caused by calling 
doc() before calling next(). we had quite a few  of those.  

-1 is good, as it supports move to the first in next() without 
if(initialized), just incrementing  
. 
cheers, 
eks



From: Shai Erera ser...@gmail.com
To: java-dev@lucene.apache.org
Sent: Friday, 24 April, 2009 21:26:21
Subject: Another possible optimization - now in DocIdSetIterator


Hi

I think we can make some optimization to DocIdSetIterator. Today, it defines 
next() and skipTo(int) which return a boolean. I've checked the code and it 
looks like almost always when these two are called, they are followed by a call 
to doc().

I was thinking that if those two returned the doc Id they are at, instead of 
boolean, that will save the call to doc(). Those that use these can:
* Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are 
no more docs in this iterator.
* If skipTo() is called, compare the 'target' to the returned Id, and if they 
are not the same, save it so that the next skipTo is requested, they don't 
perform it if the returned Id is greater than the target. If it's not possible 
to save it, they can call doc() to get that information.

The way I see it, the impls that will still need to call doc() will lose 
nothing. All we'll do is change the 'if' from comparing a boolean to comparing 
ints (even though that's a bit less efficient than comparing booleans). The 
impls that call doc() just because all they have in hand is a boolean, will 
gain.

Obviously we can't change those methods' signature, so we can deprecate them 
and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() 
around though.

What do you think? If you agree to this change, I can add them to 1593, or 
create a new issue (I prefer the latter so that 1593 will be focused on the 
changes to Collectors).

Shai



  

[jira] Updated: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME

2009-04-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1611:
---

Attachment: LUCENE-1611.patch

Attached patch to prevent starting new merges after OOME, and to throw 
IllegalStateException in optimize, expungeDeletes if OOME is hit.  I plan to 
commit in a day or two.

 Do not launch new merges if IndexWriter has hit OOME
 

 Key: LUCENE-1611
 URL: https://issues.apache.org/jira/browse/LUCENE-1611
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1611.patch


 if IndexWriter has hit OOME, it defends itself by refusing to commit changes 
 to the index, including merges.  But this can lead to infinite merge attempts 
 because we fail to prevent starting a merge.
 Spinoff from 
 http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl

2009-04-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1610.


Resolution: Fixed

Thanks Steve!

 Preserve whitespace in code sections in the Changes.html generated from 
 CHANGES.txt by changes2html.pl
 

 Key: LUCENE-1610
 URL: https://issues.apache.org/jira/browse/LUCENE-1610
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.9
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1610.patch


 The Trunk section of CHANGES.txt sports use of a new feature: code 
 sections, for the two mentions of LUCENE-1575.
 This looks fine in the text rendering, but looks crappy in the HTML version, 
 since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML 
 rendering, but the newlines in the code are converted to a single space. 
 I think this should be fixed by modifying changes2html.pl to convert code 
 and /code into (unescaped) codepre and /pre/code, respectively, 
 since just passing through code and /code, without /?pre, while 
 changing the font to monospaced (nice), still collapses whitespace (not 
 nice). 
 See the java-dev thread that spawned this issue here: 
 http://www.nabble.com/CHANGES.txt-td23102627.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702579#action_12702579
 ] 

Michael McCandless commented on LUCENE-1313:


{quote}
So we're ok with the blocking that occurs when the ram buffer is
flushed to the ramdir?
{quote}

Well... we don't have a choice (unless/until we implement IndexReader impl to 
directly search the RAM buffer).  Still, this should be a good improvement over 
the blocking when flushing to a real dir.

{quote}
This is pretty much like resolveExternalSegments which would be
called in prepareCommit? This could make calls to commit much
more time consuming. It may be confusing to the user why
IW.flush doesn't copy the ram segments to disk.
{quote}

Similar... the difference is I'd prefer to do a merge of the RAM segments vs 
the straight one-for-one copy that resolveExternalSegments does.

commit would only become more time consuming in the NRT case?  IE we'd only 
flush-to-RAMdir if it's getReader that's forcing the flush?  In which case, I 
think it's fine that commit gets more costly.  Also, I wouldn't expect it to be 
much more costly: we are doing an in-memory merge of N segments, writing one 
segment to the real directory.  Vs writing each tiny segment as a real one.  
In fact, commit could get cheaper (when compared to not making this change) 
since there are fewer new files to fsync.

{quote}
Agreed, however the IW.getReader MultiSegmentReader removes
readers from another directory so we'd need to add a new
attribute to segmentinfo that marks it as ok for inclusion in
the MSR?
{quote}

Or, fix that filtering to also accept IndexWriter's RAMDir.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702596#action_12702596
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

I'm confused as to how we make DocumentsWriter switch from
writing to disk vs the ramdir? It seems like a fairly major
change to the system? One that's hard to switch later on after
IW is instantiated? Perhaps the IW.addWriter method is easier in
this regard?

{quote} the difference is I'd prefer to do a merge of the RAM
segments vs the straight one-for-one copy that
resolveExternalSegments does.{quote}

Yeah I implemented it this way in the IW.addWriter code. I agree
it's better for IW.commit to copy all the ramdir segments to one
disk segment. 

I started working on the IW.addWriter(IndexWriter, boolean
removeFrom) where removeFrom removes the segments that have been
copied to the destination writer from the source writer. This
method gets around the issue of blocking because potentially
several writers could concurrently be copied to the destination
writer. The only issue at this point is how the destination
writer obtains segmentreaders from source readers when they're
in the other writers' pool? Maybe the SegmentInfo can have a
reference to the writer it originated in? That way we can easily
access the right reader pool when we need it?



 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: CHANGES.txt

2009-04-24 Thread Steven A Rowe
On 4/24/2009 at 6:24 PM, Michael McCandless wrote:
 On Fri, Apr 24, 2009 at 5:44 PM, Steven A Rowe sar...@syr.edu wrote:
  On 4/24/2009 at 4:45 PM, Michael McCandless wrote:
   On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu
   wrote:
- Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are
mentioned in both the 2.4.1 section and in the Trunk section.
AFAICT, it has not been standard practice to mention bug
fixes on a major or minor release (which Trunk will become)
if they are mentioned on a prior patch release.
  
   Hmm -- I thought it'd be good to be clear on which bugs were
   fixed, where, even if it causes some redundancy?
[...]
  So the policy you're suggesting is: When backporting bug fixes
  from trunk to a patch version, make note of the change in both
  the trunk and patch version sections of CHANGES.txt, right?
 
  Makes sense (though as I noted, this policy has never before been
  used)
 
 Hmmm.
 
  , but why then did you include only 5 out of the 15 bug fixes
  listed under 2.4.1 in the Trunk section?
 
 Yeah good point... let me better describe what I've been doing, and
 then we can separately decide if it's good or not!
 
 For tiny bug fixes, eg say LUCENE-1429 or LUCENE-1474, I often don't
 include a CHANGES entry in trunk, because I want to keep the signal
 to noise ratio higher at that point for eventual users upgrading to
 the next major release.
 
 But then when I backport anything to a point release, I try very hard
 to include an entry in CHANGES for every little change, on the
 thinking that people considering a point release upgrade really want
 to know every single change (to properly assess risk/benefit).
 
 When I release a point release, I then carry forward its entry back to
 the trunk's CHANGES, and so then we see some issues listed only in
 2.4.1, which is bad since it could make people think they were in fact
 not fixed on trunk.
 
 So what to do?
 
 Maybe even tiny bug fixes should always be called out on trunk's
 CHANGES.  Or, maybe a tiny bug fix that also gets backported to a
 point release, must then be called out in both places?  I think I
 prefer the 2nd.

The difference between these two options is that in the 2nd, tiny bug fixes are 
mentioned in trunk's CHANGES only if they are backported to a point release, 
right?

For the record, the previous policy (the zeroth option :) appears to be that 
backported bug fixes, regardless of size, are mentioned only once, in the 
CHANGES for the (chronologically) first release in which they appeared.  You 
appear to oppose this policy, because (paraphrasing): people would wonder 
whether point release fixes were also fixed on following major/minor releases.  
IMNSHO, however, people (sometimes erroneously) view product releases as 
genetically linear: naming a release A.(B)[.x] implies inclusion of all 
changes to any release A.B[.y].  I.e., my sense is quite the opposite of yours: 
I would be *shocked* if bug fixes included in version 2.4.1 were not included 
(or explicitly called out as not included) in version 2.9.0.

If more than one point release branch is active at any one time, then things 
get more complicated (genetic linearity can no longer be assumed), and your new 
policy seems like a reasonable attempt at managing the complexity.  But will 
Lucene ever have more than one active bugfix branch?  It never has before.

But maybe I'm not understanding your intent: are you distinguishing between 
released CHANGES and unreleased CHANGES?  That is, do you intend to apply this 
new policy only to the unreleased trunk CHANGES, but then remove the redundant 
bugfix notices once a release is performed?

Steve


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Another possible optimization - now in DocIdSetIterator

2009-04-24 Thread Yonik Seeley
On Fri, Apr 24, 2009 at 6:20 PM, eks dev eks...@yahoo.co.uk wrote:
 just do not forget to use -1  doc instead of -1 != doc

Perhaps doc =0 instead of doc != -1?
The crux of it is that status flags (result positive, negative, or
zero) are set by many operations - hence a compare/test operation can
often be eliminated.  For this same reason, counting down to zero in a
loop instead of counting up to a limit can be slightly faster.  It's a
single cycle though normally not much to worry about :-)

Of course, now we have processors like the i7 with macro-ops fusion
that can take a TEST/CMP followed by a branch and fuse it into a
single operation may level the field again.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-24 Thread John Wang (JIRA)
expose lastDocId in the posting from the TermEnum API
-

 Key: LUCENE-1612
 URL: https://issues.apache.org/jira/browse/LUCENE-1612
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: John Wang


We currently have on the TermEnum api: docFreq() which gives the number docs in 
the posting.
It would be good to also have the max docid in the posting. That information is 
useful when construction a custom DocIdSet, .e.g determine sparseness of the 
doc list to decide whether or not to use a BitSet.

I have written a patch to do this, the problem with it is the TermInfosWriter 
encodes values in VInt/VLong, there is very little flexibility to add in 
lastDocId while making the index backward compatible. (If simple int is used 
for say, docFreq, a bit can be used to flag reading of a new piece of 
information)

output.writeVInt(ti.docFreq);   // write doc freq
output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
output.writeVLong(ti.proxPointer - lastTi.proxPointer);

Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1612:
--

Attachment: lucene-1612-patch.txt

Patch attach with test. Index is not backwards compatible.

 expose lastDocId in the posting from the TermEnum API
 -

 Key: LUCENE-1612
 URL: https://issues.apache.org/jira/browse/LUCENE-1612
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: John Wang
 Attachments: lucene-1612-patch.txt


 We currently have on the TermEnum api: docFreq() which gives the number docs 
 in the posting.
 It would be good to also have the max docid in the posting. That information 
 is useful when construction a custom DocIdSet, .e.g determine sparseness of 
 the doc list to decide whether or not to use a BitSet.
 I have written a patch to do this, the problem with it is the TermInfosWriter 
 encodes values in VInt/VLong, there is very little flexibility to add in 
 lastDocId while making the index backward compatible. (If simple int is used 
 for say, docFreq, a bit can be used to flag reading of a new piece of 
 information)
 output.writeVInt(ti.docFreq);   // write doc freq
 output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
 output.writeVLong(ti.proxPointer - lastTi.proxPointer);
 Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
 TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)
TermEnum.docFreq() is not updated with there are deletes


 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang


TermEnum.docFreq is used in many places, especially scoring. However, if there 
are deletes in the index and it is not yet merged, this value is not updated.

Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1613:
--

Attachment: TestDeleteAndDocFreq.java

Test showing docFreq not updated when there are deletes.

 TermEnum.docFreq() is not updated with there are deletes
 

 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang
 Attachments: TestDeleteAndDocFreq.java


 TermEnum.docFreq is used in many places, especially scoring. However, if 
 there are deletes in the index and it is not yet merged, this value is not 
 updated.
 Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702649#action_12702649
 ] 

John Wang commented on LUCENE-1613:
---

I understand this is a rather difficult problem to fix. I thought keeping a 
jira ticket would still be good for tracking purposes. Will let the committers 
decide on the urgency on this issue.

 TermEnum.docFreq() is not updated with there are deletes
 

 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang
 Attachments: TestDeleteAndDocFreq.java


 TermEnum.docFreq is used in many places, especially scoring. However, if 
 there are deletes in the index and it is not yet merged, this value is not 
 updated.
 Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



perf enhancement and lucene-1345

2009-04-24 Thread John Wang
Hi Guys:
 A while ago I posted some enhancements to disjunction and conjunction
docIdSetIterators that showed performance improvements to Lucene-1345. I
think it got mixed up with another discussion on that issue. Was wondering
what happened with it and what are the plans.

Thanks

-John


[jira] Created: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-24 Thread Shai Erera (JIRA)
Add next() and skipTo() variants to DocIdSetIterator that return the current 
doc, instead of boolean


 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


See 
http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
 for the full discussion. The basic idea is to add variants to those two 
methods that return the current doc they are at, to save successive calls to 
doc(). If there are no more docs, return -1. A summary of what was discussed so 
far:
# Deprecate those two methods.
# Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
(calls next() and skipTo() respectively, and will be changed to abstract in 
3.0).
#* I actually would like to propose an alternative to the names: advance() and 
advance(int) - the first advances by one, the second advances to target.
# Wherever these are used, do something like '(doc = advance()) = 0' instead 
of comparing to -1 for improved performance.

I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702651#action_12702651
 ] 

Shai Erera commented on LUCENE-1593:


bq. Same for BS2.score() and score(HC) - initCountingSumScorer?

I proposed this change, but it is problematic. initCountingSumScorer declares 
throwing IOE, but neither any of the ctors or add() declares it. So I cannot 
add a call to initCountingSumScorer without breaking back-compat, unless:
* I wrap IOE with RE, but I don't like it very much.
* We close our eyes saying it's not very likely that BS2 is used on its own, 
outside BooleanQuery (or BooleanWeight for that matters). I did a short test 
and added IOE, nothing breaks (at least on the 'java' side, haven't checked 
contrib), and I'm pretty confident other code will not break as well, since the 
rest of the methods do throw IOE (like score(), next(), skipTo()), and why 
would you want to initialize BS2 if not to call these?

What do you think about the 2nd option?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org