date:20090402

On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Now that LUCENE-1516 is close to being committed perhaps we can
 figure out the priority of other issues:

 1. Searchable IndexWriter RAM buffer

I think first priority is to get a good assessment of the performance
of the current implementation (from LUCENE-1516).

My initial tests are very promising: with a writer updating (replacing
random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
was able to get reopen the reader once per second and do a large (
500K results) search that sorts by date.  The reopen time was
typically ~40 msec, and search time typically ~35 msec (though there
were random spikes up to ~340 msec).  Though, these results were on an
SSD (Intel X25M 160 GB).

We need more datapoints of the current approach, but this looks likely
to be good enough for starters.  And since we can get it into 2.9,
hopefully it'll get some early usage and people will report back to
help us assess whether further performance improvements are necessary.

If they do turn out to be necessary, I think before your step 1, we
should write small segments into a RAMDirectory instead of the real
directory.  That's simpler than truly searching IndexWriter's
in-memory postings data.

 2. Finish up benchmarking and perhaps implement passing
 filters to the SegmentReader level

What is passing filters to the SegmentReader level?  EG as of
LUCENE-1483, we now ask a Filter for it's DocIdSet once per
SegmentReader.

 3. Deleting by doc id using IndexWriter

We need a clean approach for the docIDs suddenly shift when merge is
committed problem for this...

Thinking more on this... I think one possible solution may be to
somehow expose IndexWriter's internal docID remapping code.
IndexWriter does delete by docID internally, and whenever a merge is
committed we stop-the-world (sync on IW) and go remap those docIDs.
If we somehow allowed user to register a callback that we could call
when this remapping occurs, then user's code could carry the docIDs
without them becoming stale.  Or maybe we could make a class
PendingDocIDs, which you'd ask the reader to give you, that holds
docIDs and remaps them after each merge.  The problem is, IW
internally always logically switches to the current reader for any
further docID deletion, but the user's code may continue to use an old
reader.  So simply exposing this remapping won't fix it... we'd need
to somehow track the genealogy (quite a bit more complex).

 With 1) I'm interested in how we will lock a section of the
 bytes for use by a given reader? We would not actually lock
 them, but we need to set aside the bytes such that for example
 if the postings grows, TermDocs iteration does not progress to
 beyond it's limits. Are there any modifications that are needed
 of the RAM buffer format? How would the term table be stored? We
 would not be using the current hash method?

I think the realtime reader'd just store the maxDocID it's allowed to
search, and we would likely keep using the RAM format now used.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1313) Realtime Search

[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694917#action_12694917
]

Michael McCandless commented on LUCENE-1313:

Jason, your last patch looks like it's taking the flush first to RAM Dir
approach I just described as the next step (on the java-dev thread), right?
Which is great!

So this has no external dependencies, right? And it simply layers on top of
LUCENE-1516.

I'd be very interested to compare (benchmark) this approach vs solely
LUCENE-1516.

Could we change this class so that instead of taking a Transaction object,
holding adds deletes, it simply mirrors IndexWriter's API? Ie, I'd like to
decouple the performance optimization of let's flush small segments ithrough a
RAMDir first from the transactional semantics of I process a transaction
atomically, and lock out other thread's transactions. Ie, the transactional
restriction could/should layer on top of this performance optimization for
near-realtime search?

Realtime Search
---

Key: LUCENE-1313
URL: https://issues.apache.org/jira/browse/LUCENE-1313
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch,
lucene-1313.patch, lucene-1313.patch, lucene-1313.patch

Realtime search with transactional semantics.
Possible future directions:
* Optimistic concurrency
* Replication
Encoding each transaction into a set of bytes by writing to a RAMDirectory
enables replication. It is difficult to replicate using other methods
because while the document may easily be serialized, the analyzer cannot.
I think this issue can hold realtime benchmarks which include indexing and
searching concurrently.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-02 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694927#action_12694927
 ] 

Shai Erera commented on LUCENE-1575:


I thought that ant test runs all tests. Thanks for the education.

The reason is that TimeLimitedCollector now extends Collector, which does not 
extend HitCollector. Therefore the method attempts to return an invalid type. 
I'm not sure how to fix it, because I cannot change the 2.4 test code, since 
Collector is not there.

So the only reasonable solution I see here is to:
* Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it 
will change to extend Collector and that in the meantime use 
HitCollectorWrapper if you want.
* Comment out all the Collector related methods, including the new ctor, with a 
TODO to reenstate in 3.0.
* Fix the TestTimeLimitedCollector wrap it with a HCW as well as using only 
HitCollector as the wrapped collector.

Other solutions which I don't like are:
* deprecate TLC and create a new one NewTimeLimitedCollector - I hate the name 
:)
* Have Collector extend HitCollector - I hate to even consider that.

What do you think?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694938#action_12694938
]

Michael McCandless commented on LUCENE-1575:

bq. I thought that ant test runs all tests. Thanks for the education.

Probably, it should. I'll raise this on java-dev.

bq. Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it
will change to extend Collector and that in the meantime use
HitCollectorWrapper if you want.

I think I like this solution best (though this is very much a lesser of all
evils situation).

lament
Ahh the contortions we must go through because of Lucene's success. Marvin
over on Lucy can happily make major changes without batting an eye. The sad
reality is that the ongoing growth rate of a thing is inversely proportional to
its popularity.
/lament

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.patch

ant test should include test-tag

I think back-compat tests (ant test-tag) should run when you run ant test.

Any objections?

If not I'll commit soon...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694919#action_12694919
 ] 

Michael McCandless commented on LUCENE-1575:


Could you also run ant test-tag (which tests JAR-drop-in back-compatibility)? 
 EG I'm getting this compilation error:
{code}
[javac] 
/lucene/src/lucene.collection/tags/lucene_2_4_back_compat_tests_20090320/src/test/org/apache/lucene/search/TestTimeLimitedCollector.java:136:
 incompatible types
[javac] found   : org.apache.lucene.search.TimeLimitedCollector
[javac] required: org.apache.lucene.search.HitCollector
[javac] return res;
[javac]^
{code}


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue

[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter


 [ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1516:
---

Attachment: LUCENE-1516.patch

Added another test case to TestIndexWriterReader, stress testing 
adding/deleting docs while constant opening near real-time reader.

 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ant test should include test-tag

2009-04-02 Thread Shai Erera

I definitely agree. It would have saved me another patch submission in 1575
:)

On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 I think back-compat tests (ant test-tag) should run when you run ant
 test.

 Any objections?

 If not I'll commit soon...

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ant test should include test-tag

2009-04-02 Thread Mark Miller


Shai Erera wrote:
I definitely agree. It would have saved me another patch submission in 
1575 :)


On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless 
luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:


I think back-compat tests (ant test-tag) should run when you run
ant test.

Any objections?

If not I'll commit soon...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
mailto:java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
mailto:java-dev-h...@lucene.apache.org



As long as I still have a target that will test without back compat tests.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ant test should include test-tag

OK I'll add a test-core-contrib target.

Mike

On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote:
 Shai Erera wrote:

 I definitely agree. It would have saved me another patch submission in
 1575 :)

 On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless
 luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:

    I think back-compat tests (ant test-tag) should run when you run
    ant test.

    Any objections?

    If not I'll commit soon...

    Mike

    -
    To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
    mailto:java-dev-unsubscr...@lucene.apache.org
    For additional commands, e-mail: java-dev-h...@lucene.apache.org
    mailto:java-dev-h...@lucene.apache.org


 As long as I still have a target that will test without back compat tests.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ant test should include test-tag

2009-04-02 Thread Mark Miller

Wouldn't hurt I suppose - but test-core and test-contrib are probably 
sufficient. I wasn't very clear with that comment. I was just saying, as 
long as I can still run the tests a bit quicker than running through 
everything twice - which is already available. I should have just said 
+1. On the other hand, test-core-contrib doesn't hurt anything.


Michael McCandless wrote:

OK I'll add a test-core-contrib target.

Mike

On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote:
  

Shai Erera wrote:


I definitely agree. It would have saved me another patch submission in
1575 :)

On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless
luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:

   I think back-compat tests (ant test-tag) should run when you run
   ant test.

   Any objections?

   If not I'll commit soon...

   Mike

   -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   mailto:java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
   mailto:java-dev-h...@lucene.apache.org


  

As long as I still have a target that will test without back compat tests.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-02 Thread Uwe Schindler (JIRA)

Make TrieRange completely independent from Document/Field with TokenStream of 
prefix encoded values
---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


TrieRange has currently the following problem:
- To add a field, that uses a trie encoding, you can manually add each term to 
the index or use a helper method from TrieUtils. The helper method has the 
problem, that it uses a fixed field configuration
- TrieUtils currently creates per default a helper field containing the lower 
precision terms to enable sorting (limitation of one term/document for sorting)
- trieCodeLong/Int() creates unnecessarily arrays of String and char[] arrays 
that is heavy for GC, if you index lot of numeric values. Also a lot of char[] 
to String copying is involved.

This issue should improve this:
- trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays 
are reused by Token API, additional STRing[] arrays for the encoded result are 
not created, instead the TokenStream enumerates the trie values.
- Documents can be added to Documents during indexing using the standard API: 
new Field(name,TokenStream,...), so no extra util method needed. By using token 
filters, one could also add payload and so and customize everything.

The drawback is: Sorting would not work anymore. To enable sorting, a 
(sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a 
lower precision one is enumerated by TermEnum. I will create a hack patch for 
TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop 
iteration. With LUCENE-831, a more generic API for this type can be used 
(custom parser/iterator implementation for FieldCache). I will attach the field 
cache patch (with the temporary solution, util FieldCache is reimplemented) as 
a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-02 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.6.patch

Changes:
# TimeLimitedCollector, TestTimeLimitedCollector and CHANGES.
# I also fixed a bug in TestTermScorer, which was discovered by the test-tag
task, and existed since 1483 and propagated into HitCollectorWrapper as well:
docBase was set to -1 by default, relying on setNextReader to be called.
However if it's not called (as in TestTermScorer, or if someone called
Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test
had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to
shift, it failed. Anyway, the test now works correctly, as well as HCW.
# I checked all other Collector implementations and changed the default base to
0, unless in some test cases, where -1 had a meaning.

All tests (contrib, core and tags) pass.

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-02 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1582:
--

Description:
TrieRange has currently the following problem:
- To add a field, that uses a trie encoding, you can manually add each term to
the index or use a helper method from TrieUtils. The helper method has the
problem, that it uses a fixed field configuration
- TrieUtils currently creates per default a helper field containing the lower
precision terms to enable sorting (limitation of one term/document for sorting)
- trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is
heavy for GC, if you index lot of numeric values. Also a lot of char[] to
String copying is involved.

This issue should improve this:
- trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays
are reused by Token API, additional String[] arrays for the encoded result are
not created, instead the TokenStream enumerates the trie values.
- Trie fields can be added to Documents during indexing using the standard API:
new Field(name,TokenStream,...), so no extra util method needed. By using token
filters, one could also add payload and so and customize everything.

The drawback is: Sorting would not work anymore. To enable sorting, a
(sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a
lower precision one is enumerated by TermEnum. I will create a hack patch for
TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop
iteration. With LUCENE-831, a more generic API for this type can be used
(custom parser/iterator implementation for FieldCache). I will attach the field
cache patch (with the temporary solution, until FieldCache is reimplemented) as
a separate patch file, or maybe open another issue for it.

was:
TrieRange has currently the following problem:
- To add a field, that uses a trie encoding, you can manually add each term to
the index or use a helper method from TrieUtils. The helper method has the
problem, that it uses a fixed field configuration
- TrieUtils currently creates per default a helper field containing the lower
precision terms to enable sorting (limitation of one term/document for sorting)
- trieCodeLong/Int() creates unnecessarily arrays of String and char[] arrays
that is heavy for GC, if you index lot of numeric values. Also a lot of char[]
to String copying is involved.

This issue should improve this:
- trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays
are reused by Token API, additional STRing[] arrays for the encoded result are
not created, instead the TokenStream enumerates the trie values.
- Documents can be added to Documents during indexing using the standard API:
new Field(name,TokenStream,...), so no extra util method needed. By using token
filters, one could also add payload and so and customize everything.

The drawback is: Sorting would not work anymore. To enable sorting, a
(sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a
lower precision one is enumerated by TermEnum. I will create a hack patch for
TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop
iteration. With LUCENE-831, a more generic API for this type can be used
(custom parser/iterator implementation for FieldCache). I will attach the field
cache patch (with the temporary solution, util FieldCache is reimplemented) as
a separate patch file, or maybe open another issue for it.

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Key: LUCENE-1582
URL: https://issues.apache.org/jira/browse/LUCENE-1582
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Fix For: 2.9

Atomic optimize() + commit()

2009-04-02 Thread Shai Erera

Hi

I've run into a problem in my code when I upgraded to 2.4. I am not sure if
it is a real problem, but I thought I'd let you know anyway. The following
is a background of how I ran into the issue, but I think the discussion does
not necessarily involve my use of Lucene.

I have a class which wraps all Lucene-related operations, i.e., addDocument,
deleteDocument, search and optimize (those are the important ones for this
email). It maintains an IndexWriter open, through which it does the
add/delete/optimize operations and periodically opens an IndexReader for the
search operations using the reopen() API.

The application indexes operations (add, delete, update) by multiple
threads, while there's a manager which after the last operation has been
processed, calls commit, which does writer.commit(). I also check from time
to time if the index needs to be optimized and optimizes if needed (the
criteria for when to do it is irrelevant now). I also have a unit test which
does several add/update/delete operations, calls optimize and checks the
number of deleted documents. It expects to find 0, since optimize has been
called and after I upgraded to 2.4 it failed.

Now ... with the move to 2.4, I discovered that optimize() does not commit
automatically and I have to call commit. It's a good place to say that when
I was on 2.3 I used the default autoCommit=true and with the move to 2.4
that default has changed, and being a good citizen, I also changed my code
to call commit when I want and not use any deprecated ctors or rely on
internal Lucene logic. I can only guess that that's why at the end of the
test I still see numDeletedDocs != 0 (since optimize does not commit by
default).

So I went ahead and fixed my optimize() method to do: (1) writer.optimize()
(2) writer.commit().

But then I thought - is this fix correct? Is it the right approach? Suppose
that at the sime time optimize was running, or just between (1) and (2)
there was a context switch, and a thread added documents to the index. Upon
calling commit(), the newly added documents are also committed, without the
caller intending to do so. In my scenario this will probably not be too
catastrophically, but I can imagine scenarios in which someone in addition
to indexing updates a DB and has a virtual atomic commit, which commits the
changes to the index as well as the DB, all the while locking any update
operations. Suddenly that someone's code breaks.

There are a couple of ways I can solve it, like for example synchronizing
the optimize + commit on a lock which all indexing threads will also
synchronize (allowing all of them to index concurrently, but if optimize is
running all are blocked), but that will hold all my indexing threads. Or, I
can just not call commit at the end, relying on the workers manager to
commit at the next batch indexing work. However, during that time the
readers will search on an unoptimized index, with deletes, while they can
search on a freshly optimized index with no deletes (and less segments).

The problem with those solutions is that they are not intuitive. To start
with, the Lucene documentation itself is wrong - In IndexWriter.commit() it
says: Commits all pending updates (added  deleted documents) - optimize
is not mentioned (shouldn't this be fixed anyway?). Also, notice that the
problem stems from the fact that the optimize operation may be called by
another thread, not knowing there are update operations running. Lucene
documents that you can call addDocument while optimize() is running, so
there's no need to protect against that. Suddenly, we're requiring every
search application developer to disregard the documentation and think to
himself do I want to allow optimize() to run concurrently with
add/deletes?. I'm not saying that it's wrong, but if you're ok with it, we
should document it.

I wonder though if there isn't room to introduce an atomic optimize() +
commit() in Lucene. The incentive is that optimize is not the same as
add/delete. Add/delete are operations I may want to hide from my users,
because they change the state of the index (i.e., how many searchable
documents are there). Optimize just reorganizes the index, and is supposed
to improve performance. When I call optimize, don't I want it to be
committed? Will I ever want to hold that commit off (taking out edge cases)?
I assume that 99.9% of the time that's what we expect from it.

Now, just adding a call to commit() at the end of optimize() will not solve
it, because that's the same as calling commit outside optimize(). We need
the optimize's commit to only commit its changes. And if there are updates
pending commit - not touch them.

BTW, I've scanned through the documentation and haven't found any mention of
such thing, however I may have still missed it. So if there is already a
solution to that, or such an atomic optimize+commit, I apologize in advance
for forcing you to read such a long email (for those of you who made it thus
far) and

[jira] Created: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards

2009-04-02 Thread Moti Nisenson (JIRA)

SpanOrQuery skipTo() doesn't always move forwards
-

 Key: LUCENE-1583
 URL: https://issues.apache.org/jira/browse/LUCENE-1583
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9
Reporter: Moti Nisenson
Priority: Minor


In SpanOrQuery the skipTo() method is improperly implemented if the target doc 
is less than or equal to the current doc, since skipTo() may not be called for 
any of the clauses' spans:

public boolean skipTo(int target) throws IOException {
  if (queue == null) {
return initSpanQueue(target);
  }

  while (queue.size() != 0  top().doc()  target) {
if (top().skipTo(target)) {
  queue.adjustTop();
} else {
  queue.pop();
}
  }
  
return queue.size() != 0;
}

This violates the correct behavior (as described in the Spans interface 
documentation), that skipTo() should always move forwards, in other words the 
correct implementation would be:

public boolean skipTo(int target) throws IOException {
  if (queue == null) {
return initSpanQueue(target);
  }

  boolean skipCalled = false;
  while (queue.size() != 0  top().doc()  target) {
if (top().skipTo(target)) {
  queue.adjustTop();
} else {
  queue.pop();
}
skipCalled = true;
  }
  
  if (skipCalled) {
return queue.size() != 0;
  }
  return next();
}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards


 [ 
https://issues.apache.org/jira/browse/LUCENE-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1583:
---

Fix Version/s: 2.9

LUCENE-1327 was a similar issue.

 SpanOrQuery skipTo() doesn't always move forwards
 -

 Key: LUCENE-1583
 URL: https://issues.apache.org/jira/browse/LUCENE-1583
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
Reporter: Moti Nisenson
Priority: Minor
 Fix For: 2.9


 In SpanOrQuery the skipTo() method is improperly implemented if the target 
 doc is less than or equal to the current doc, since skipTo() may not be 
 called for any of the clauses' spans:
 public boolean skipTo(int target) throws IOException {
   if (queue == null) {
 return initSpanQueue(target);
   }
   while (queue.size() != 0  top().doc()  target) {
 if (top().skipTo(target)) {
   queue.adjustTop();
 } else {
   queue.pop();
 }
   }
   
   return queue.size() != 0;
 }
 This violates the correct behavior (as described in the Spans interface 
 documentation), that skipTo() should always move forwards, in other words the 
 correct implementation would be:
 public boolean skipTo(int target) throws IOException {
   if (queue == null) {
 return initSpanQueue(target);
   }
   boolean skipCalled = false;
   while (queue.size() != 0  top().doc()  target) {
 if (top().skipTo(target)) {
   queue.adjustTop();
 } else {
   queue.pop();
 }
 skipCalled = true;
   }
   
   if (skipCalled) {
   return queue.size() != 0;
   }
   return next();
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ant test should include test-tag

OK I I just left that new one off.  So you have to run ant test-core
test-contrib.

Mike

On Thu, Apr 2, 2009 at 7:21 AM, Mark Miller markrmil...@gmail.com wrote:
 Wouldn't hurt I suppose - but test-core and test-contrib are probably
 sufficient. I wasn't very clear with that comment. I was just saying, as
 long as I can still run the tests a bit quicker than running through
 everything twice - which is already available. I should have just said +1.
 On the other hand, test-core-contrib doesn't hurt anything.

 Michael McCandless wrote:

 OK I'll add a test-core-contrib target.

 Mike

 On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote:


 Shai Erera wrote:


 I definitely agree. It would have saved me another patch submission in
 1575 :)

 On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless
 luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:

   I think back-compat tests (ant test-tag) should run when you run
   ant test.

   Any objections?

   If not I'll commit soon...

   Mike

   -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   mailto:java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
   mailto:java-dev-h...@lucene.apache.org




 As long as I still have a target that will test without back compat
 tests.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695016#action_12695016
]

Michael McCandless commented on LUCENE-1582:

This sounds like a great improvement!

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

TrieRange has currently the following problem:
- To add a field, that uses a trie encoding, you can manually add each term
to the index or use a helper method from TrieUtils. The helper method has the
problem, that it uses a fixed field configuration
- TrieUtils currently creates per default a helper field containing the lower
precision terms to enable sorting (limitation of one term/document for
sorting)
- trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is
heavy for GC, if you index lot of numeric values. Also a lot of char[] to
String copying is involved.
This issue should improve this:
- trieCodeLong/Int() returns a TokenStream. During encoding, all char[]
arrays are reused by Token API, additional String[] arrays for the encoded
result are not created, instead the TokenStream enumerates the trie values.
- Trie fields can be added to Documents during indexing using the standard
API: new Field(name,TokenStream,...), so no extra util method needed. By
using token filters, one could also add payload and so and customize
everything.
The drawback is: Sorting would not work anymore. To enable sorting, a
(sub-)issue can extend the FieldCache to stop iterating the terms, as soon as
a lower precision one is enumerated by TermEnum. I will create a hack patch
for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to
stop iteration. With LUCENE-831, a more generic API for this type can be used
(custom parser/iterator implementation for FieldCache). I will attach the
field cache patch (with the temporary solution, until FieldCache is
reimplemented) as a separate patch file, or maybe open another issue for it.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Atomic optimize() + commit()

With ConcurrentMergeScheduler, IndexWriter has gained alot of
concurrency, such that an optimize (or normal BG merge) could be
running at the same time as deletes/adds.  I think this is a good
thing and we should keep improving it (there are still places that
block, eg while a flush is running a merge cannot commit).

But, there are clearly cases where you want to explicitly prevent
concurrency operations (like your class that wraps
IndexWriter/Reader).  The current patch on LUCENE-1313 has something
similar, except in that case the atomic operation is do adds, do
deletes, open new near-realtime reader.  Grant also proposed
generalizing IndexAccessor (in LUCENE-1516).

However: I think all such logic should live above
IndexWriter/IndexReader.  IndexWriter should try to be as concurrent
as possible, and if apps need further atomicity of certain groups of
operations, it should be done outside of Lucene's core.  Of course, if
IndexWriter doesn't expose enough APIs to enable such atomicity, we
should fix that.

I definitely agree we should fix commit's javadocs to include other
changes, like optimize() calls, addIndexes, etc. -- I'll do that.

Mike

On Thu, Apr 2, 2009 at 8:22 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 I've run into a problem in my code when I upgraded to 2.4. I am not sure if
 it is a real problem, but I thought I'd let you know anyway. The following
 is a background of how I ran into the issue, but I think the discussion does
 not necessarily involve my use of Lucene.

 I have a class which wraps all Lucene-related operations, i.e., addDocument,
 deleteDocument, search and optimize (those are the important ones for this
 email). It maintains an IndexWriter open, through which it does the
 add/delete/optimize operations and periodically opens an IndexReader for the
 search operations using the reopen() API.

 The application indexes operations (add, delete, update) by multiple
 threads, while there's a manager which after the last operation has been
 processed, calls commit, which does writer.commit(). I also check from time
 to time if the index needs to be optimized and optimizes if needed (the
 criteria for when to do it is irrelevant now). I also have a unit test which
 does several add/update/delete operations, calls optimize and checks the
 number of deleted documents. It expects to find 0, since optimize has been
 called and after I upgraded to 2.4 it failed.

 Now ... with the move to 2.4, I discovered that optimize() does not commit
 automatically and I have to call commit. It's a good place to say that when
 I was on 2.3 I used the default autoCommit=true and with the move to 2.4
 that default has changed, and being a good citizen, I also changed my code
 to call commit when I want and not use any deprecated ctors or rely on
 internal Lucene logic. I can only guess that that's why at the end of the
 test I still see numDeletedDocs != 0 (since optimize does not commit by
 default).

 So I went ahead and fixed my optimize() method to do: (1) writer.optimize()
 (2) writer.commit().

 But then I thought - is this fix correct? Is it the right approach? Suppose
 that at the sime time optimize was running, or just between (1) and (2)
 there was a context switch, and a thread added documents to the index. Upon
 calling commit(), the newly added documents are also committed, without the
 caller intending to do so. In my scenario this will probably not be too
 catastrophically, but I can imagine scenarios in which someone in addition
 to indexing updates a DB and has a virtual atomic commit, which commits the
 changes to the index as well as the DB, all the while locking any update
 operations. Suddenly that someone's code breaks.

 There are a couple of ways I can solve it, like for example synchronizing
 the optimize + commit on a lock which all indexing threads will also
 synchronize (allowing all of them to index concurrently, but if optimize is
 running all are blocked), but that will hold all my indexing threads. Or, I
 can just not call commit at the end, relying on the workers manager to
 commit at the next batch indexing work. However, during that time the
 readers will search on an unoptimized index, with deletes, while they can
 search on a freshly optimized index with no deletes (and less segments).

 The problem with those solutions is that they are not intuitive. To start
 with, the Lucene documentation itself is wrong - In IndexWriter.commit() it
 says: Commits all pending updates (added  deleted documents) - optimize
 is not mentioned (shouldn't this be fixed anyway?). Also, notice that the
 problem stems from the fact that the optimize operation may be called by
 another thread, not knowing there are update operations running. Lucene
 documents that you can call addDocument while optimize() is running, so
 there's no need to protect against that. Suddenly, we're requiring every
 search application developer to disregard the documentation and think to
 himself do I want to

Re: Future projects

2009-04-02 Thread John Wang

Michael:
   I love your suggestion on 3)!

   This really opens doors for flexible indexing.

-John

On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Now that LUCENE-1516 is close to being committed perhaps we can
  figure out the priority of other issues:
 
  1. Searchable IndexWriter RAM buffer

 I think first priority is to get a good assessment of the performance
 of the current implementation (from LUCENE-1516).

 My initial tests are very promising: with a writer updating (replacing
 random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
 was able to get reopen the reader once per second and do a large (
 500K results) search that sorts by date.  The reopen time was
 typically ~40 msec, and search time typically ~35 msec (though there
 were random spikes up to ~340 msec).  Though, these results were on an
 SSD (Intel X25M 160 GB).

 We need more datapoints of the current approach, but this looks likely
 to be good enough for starters.  And since we can get it into 2.9,
 hopefully it'll get some early usage and people will report back to
 help us assess whether further performance improvements are necessary.

 If they do turn out to be necessary, I think before your step 1, we
 should write small segments into a RAMDirectory instead of the real
 directory.  That's simpler than truly searching IndexWriter's
 in-memory postings data.

  2. Finish up benchmarking and perhaps implement passing
  filters to the SegmentReader level

 What is passing filters to the SegmentReader level?  EG as of
 LUCENE-1483, we now ask a Filter for it's DocIdSet once per
 SegmentReader.

  3. Deleting by doc id using IndexWriter

 We need a clean approach for the docIDs suddenly shift when merge is
 committed problem for this...

 Thinking more on this... I think one possible solution may be to
 somehow expose IndexWriter's internal docID remapping code.
 IndexWriter does delete by docID internally, and whenever a merge is
 committed we stop-the-world (sync on IW) and go remap those docIDs.
 If we somehow allowed user to register a callback that we could call
 when this remapping occurs, then user's code could carry the docIDs
 without them becoming stale.  Or maybe we could make a class
 PendingDocIDs, which you'd ask the reader to give you, that holds
 docIDs and remaps them after each merge.  The problem is, IW
 internally always logically switches to the current reader for any
 further docID deletion, but the user's code may continue to use an old
 reader.  So simply exposing this remapping won't fix it... we'd need
 to somehow track the genealogy (quite a bit more complex).

  With 1) I'm interested in how we will lock a section of the
  bytes for use by a given reader? We would not actually lock
  them, but we need to set aside the bytes such that for example
  if the postings grows, TermDocs iteration does not progress to
  beyond it's limits. Are there any modifications that are needed
  of the RAM buffer format? How would the term table be stored? We
  would not be using the current hash method?

 I think the realtime reader'd just store the maxDocID it's allowed to
 search, and we would likely keep using the RAM format now used.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

4) An additional possibly contrib module is caching the results of
TermQueries.  In looking at the TermQuery code would we need to cache the
entire docs and freqs as arrays which would be a memory hog?

On Wed, Apr 1, 2009 at 4:05 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:

 Now that LUCENE-1516 is close to being committed perhaps we can
 figure out the priority of other issues:

 1. Searchable IndexWriter RAM buffer

 2. Finish up benchmarking and perhaps implement passing
 filters to the SegmentReader level

 3. Deleting by doc id using IndexWriter

 With 1) I'm interested in how we will lock a section of the
 bytes for use by a given reader? We would not actually lock
 them, but we need to set aside the bytes such that for example
 if the postings grows, TermDocs iteration does not progress to
 beyond it's limits. Are there any modifications that are needed
 of the RAM buffer format? How would the term table be stored? We
 would not be using the current hash method?

Re: Future projects

I'm interested in merging cached bitsets and field caches.  While this may
be something related to LUCENE-831, in Bobo there are custom field caches
which we want to merge in RAM (rather than reload from the reader using
termenum + termdocs).  This could somehow lead to delete by doc id.

Tracking the genealogy of segments is something we can provide as a callback
from IndexWriter?  Or could we add a method to IndexCommit or SegmentReader
that returns the segments it originated from?

On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Now that LUCENE-1516 is close to being committed perhaps we can
  figure out the priority of other issues:
 
  1. Searchable IndexWriter RAM buffer

 I think first priority is to get a good assessment of the performance
 of the current implementation (from LUCENE-1516).

 My initial tests are very promising: with a writer updating (replacing
 random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
 was able to get reopen the reader once per second and do a large (
 500K results) search that sorts by date.  The reopen time was
 typically ~40 msec, and search time typically ~35 msec (though there
 were random spikes up to ~340 msec).  Though, these results were on an
 SSD (Intel X25M 160 GB).

 We need more datapoints of the current approach, but this looks likely
 to be good enough for starters.  And since we can get it into 2.9,
 hopefully it'll get some early usage and people will report back to
 help us assess whether further performance improvements are necessary.

 If they do turn out to be necessary, I think before your step 1, we
 should write small segments into a RAMDirectory instead of the real
 directory.  That's simpler than truly searching IndexWriter's
 in-memory postings data.

  2. Finish up benchmarking and perhaps implement passing
  filters to the SegmentReader level

 What is passing filters to the SegmentReader level?  EG as of
 LUCENE-1483, we now ask a Filter for it's DocIdSet once per
 SegmentReader.

  3. Deleting by doc id using IndexWriter

 We need a clean approach for the docIDs suddenly shift when merge is
 committed problem for this...

 Thinking more on this... I think one possible solution may be to
 somehow expose IndexWriter's internal docID remapping code.
 IndexWriter does delete by docID internally, and whenever a merge is
 committed we stop-the-world (sync on IW) and go remap those docIDs.
 If we somehow allowed user to register a callback that we could call
 when this remapping occurs, then user's code could carry the docIDs
 without them becoming stale.  Or maybe we could make a class
 PendingDocIDs, which you'd ask the reader to give you, that holds
 docIDs and remaps them after each merge.  The problem is, IW
 internally always logically switches to the current reader for any
 further docID deletion, but the user's code may continue to use an old
 reader.  So simply exposing this remapping won't fix it... we'd need
 to somehow track the genealogy (quite a bit more complex).

  With 1) I'm interested in how we will lock a section of the
  bytes for use by a given reader? We would not actually lock
  them, but we need to set aside the bytes such that for example
  if the postings grows, TermDocs iteration does not progress to
  beyond it's limits. Are there any modifications that are needed
  of the RAM buffer format? How would the term table be stored? We
  would not be using the current hash method?

 I think the realtime reader'd just store the maxDocID it's allowed to
 search, and we would likely keep using the RAM format now used.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

 What is passing filters to the SegmentReader level? EG as of
LUCENE-1483, we now ask a Filter for it's DocIdSet once per
SegmentReader.

The patch I was thinking of is LUCENE-1536. I wasn't sure what
the next steps are for it, i.e. the JumpScorer,
Scorer.skipToButNotNext, or simply implementing a commitable
version of LUCENE-1536?

On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Now that LUCENE-1516 is close to being committed perhaps we can
  figure out the priority of other issues:
 
  1. Searchable IndexWriter RAM buffer

 I think first priority is to get a good assessment of the performance
 of the current implementation (from LUCENE-1516).

 My initial tests are very promising: with a writer updating (replacing
 random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
 was able to get reopen the reader once per second and do a large (
 500K results) search that sorts by date.  The reopen time was
 typically ~40 msec, and search time typically ~35 msec (though there
 were random spikes up to ~340 msec).  Though, these results were on an
 SSD (Intel X25M 160 GB).

 We need more datapoints of the current approach, but this looks likely
 to be good enough for starters.  And since we can get it into 2.9,
 hopefully it'll get some early usage and people will report back to
 help us assess whether further performance improvements are necessary.

 If they do turn out to be necessary, I think before your step 1, we
 should write small segments into a RAMDirectory instead of the real
 directory.  That's simpler than truly searching IndexWriter's
 in-memory postings data.

  2. Finish up benchmarking and perhaps implement passing
  filters to the SegmentReader level

 What is passing filters to the SegmentReader level?  EG as of
 LUCENE-1483, we now ask a Filter for it's DocIdSet once per
 SegmentReader.

  3. Deleting by doc id using IndexWriter

 We need a clean approach for the docIDs suddenly shift when merge is
 committed problem for this...

 Thinking more on this... I think one possible solution may be to
 somehow expose IndexWriter's internal docID remapping code.
 IndexWriter does delete by docID internally, and whenever a merge is
 committed we stop-the-world (sync on IW) and go remap those docIDs.
 If we somehow allowed user to register a callback that we could call
 when this remapping occurs, then user's code could carry the docIDs
 without them becoming stale.  Or maybe we could make a class
 PendingDocIDs, which you'd ask the reader to give you, that holds
 docIDs and remaps them after each merge.  The problem is, IW
 internally always logically switches to the current reader for any
 further docID deletion, but the user's code may continue to use an old
 reader.  So simply exposing this remapping won't fix it... we'd need
 to somehow track the genealogy (quite a bit more complex).

  With 1) I'm interested in how we will lock a section of the
  bytes for use by a given reader? We would not actually lock
  them, but we need to set aside the bytes such that for example
  if the postings grows, TermDocs iteration does not progress to
  beyond it's limits. Are there any modifications that are needed
  of the RAM buffer format? How would the term table be stored? We
  would not be using the current hash method?

 I think the realtime reader'd just store the maxDocID it's allowed to
 search, and we would likely keep using the RAM format now used.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695098#action_12695098
]

Michael McCandless commented on LUCENE-1575:

Super, all tests pass for me too...

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I'm interested in merging cached bitsets and field caches.  While this may
 be something related to LUCENE-831, in Bobo there are custom field caches
 which we want to merge in RAM (rather than reload from the reader using
 termenum + termdocs).  This could somehow lead to delete by doc id.

What does Bobo use the cached bitsets for?

Merging FieldCache in RAM is also interesting for near-realtime
search, once we have column stride fields.  Ie, they should behave
like deleted docs: there's no reason to go through disk when merging
them -- just carry them straight to the merged reader.  Only on commit
do they need to go to disk.  Hmm in fact we could do this today, too,
eg with norms as a future optimization if needed.  And that
optimization applies to flushing as well (ie, when flushing a new
segment, since we know we will open a reader, we could NOT flush the
norms, and instead put them into the reader, and only on eventual
commit, flush to disk).

 Tracking the genealogy of segments is something we can provide as a callback
 from IndexWriter?  Or could we add a method to IndexCommit or SegmentReader
 that returns the segments it originated from?

Well the problem with my idea (callback from IW when docs shift)
is internally IW always uses the latest reader to get any new docIDs.

Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
(where each generation is a renumbering event).

But if you have a reader, perhaps oldish by now, we'd need to give you
a way to map across N generations of docID shifts (which'd require the
genealogy tracking).

Alas I think it will quickly get hairy.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

I'm not sure how big a win this'd be, since the OS will cache those in
RAM and the CPU cost there (to pull from OS's cache and reprocess) is
maybe not high.

Optimizing search is interesting, because it's the wicked slow queries
that you need to make faster even when it's at the expense of wicked
fast queries.  If you make a wicked fast query 3X slower (eg 1 ms - 3
ms), it's almost harmless in nearly all apps.

So this makes things like PFOR (and LUCENE-1458, to enable pluggable
codecs for postings) important since it addresses the very large
queries.  In fact for very large postings we should do PFOR minus the
exceptions, ie, do a simple Nbit encode, even if it wastes some bits.

Mike

On Thu, Apr 2, 2009 at 1:52 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 4) An additional possibly contrib module is caching the results of
 TermQueries.  In looking at the TermQuery code would we need to cache the
 entire docs and freqs as arrays which would be a memory hog?

 On Wed, Apr 1, 2009 at 4:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:

 Now that LUCENE-1516 is close to being committed perhaps we can
 figure out the priority of other issues:

 1. Searchable IndexWriter RAM buffer

 2. Finish up benchmarking and perhaps implement passing
 filters to the SegmentReader level

 3. Deleting by doc id using IndexWriter

 With 1) I'm interested in how we will lock a section of the
 bytes for use by a given reader? We would not actually lock
 them, but we need to set aside the bytes such that for example
 if the postings grows, TermDocs iteration does not progress to
 beyond it's limits. Are there any modifications that are needed
 of the RAM buffer format? How would the term table be stored? We
 would not be using the current hash method?



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Thu, Apr 2, 2009 at 2:29 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 What is passing filters to the SegmentReader level? EG as of
 LUCENE-1483, we now ask a Filter for it's DocIdSet once per
 SegmentReader.

 The patch I was thinking of is LUCENE-1536. I wasn't sure what
 the next steps are for it, i.e. the JumpScorer,
 Scorer.skipToButNotNext, or simply implementing a commitable
 version of LUCENE-1536?

Ahh OK.  We should pursue this one -- many filters are cached, or
would otherwise be able to expose random-access API.  For such
filters, it'd also make sense to pre-multiply the deleted docs, to
save doing that multiply for every query that uses the filter.  We'd
need some sort of caching / segment wrapper class to manage that,
maybe?

But we should first do the Filter/Query unification, and Filter as
clause on BooleanQuery, and then re-assess the performance difference.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

 What does Bobo use the cached bitsets for?

Bobo is a faceting engine that uses custom field caches and sometimes cached
bitsets rather than relying exclusively on bitsets to calculate facets.  It
is useful where many facets (50+) need to be calculated and bitset caching,
loading and intersection would be too costly.  Instead it iterates over in
memory custom field caches while hit collecting.  Because we're also doing
realtime search, making the loading more efficient via the in memory field
cache merging is interesting.

True, we do the in memory merging with deleted docs, norms would be good as
well.  As a first step how should we expose the segments a segment has
originated from?  I would like to get this implemented for 2.9 as a building
block that perhaps we can write other things on.  Column stride fields still
requires some encoding and merging field caches in ram would presumably be
faster?

 Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where
each generation is a renumbering event).

Couldn't each SegmentReader keep a docmap and the names of the segments it
originated from.  However the name is not enough of a unique key as there's
the deleted docs that change?  It seems like we need a unique id for each
segment reader, where the id is assigned to cloned readers (which normally
have the same segment name as the original SR).  The ID could be a stamp
(perhaps only given to readonlyreaders?).  That way the
SegmentReader.getMergedFrom method does not need to return the actual
readers, but a docmap and the parent readers IDs?  It would be assumed the
user would be holding the readers somewhere?  Perhaps all this can be
achieved with a callback in IW, and all this logic could be kept somewhat
internal to Lucene?


On Thu, Apr 2, 2009 at 12:59 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I'm interested in merging cached bitsets and field caches.  While this
 may
  be something related to LUCENE-831, in Bobo there are custom field caches
  which we want to merge in RAM (rather than reload from the reader using
  termenum + termdocs).  This could somehow lead to delete by doc id.

 What does Bobo use the cached bitsets for?

 Merging FieldCache in RAM is also interesting for near-realtime
 search, once we have column stride fields.  Ie, they should behave
 like deleted docs: there's no reason to go through disk when merging
 them -- just carry them straight to the merged reader.  Only on commit
 do they need to go to disk.  Hmm in fact we could do this today, too,
 eg with norms as a future optimization if needed.  And that
 optimization applies to flushing as well (ie, when flushing a new
 segment, since we know we will open a reader, we could NOT flush the
 norms, and instead put them into the reader, and only on eventual
 commit, flush to disk).

  Tracking the genealogy of segments is something we can provide as a
 callback
  from IndexWriter?  Or could we add a method to IndexCommit or
 SegmentReader
  that returns the segments it originated from?

 Well the problem with my idea (callback from IW when docs shift)
 is internally IW always uses the latest reader to get any new docIDs.

 Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
 (where each generation is a renumbering event).

 But if you have a reader, perhaps oldish by now, we'd need to give you
 a way to map across N generations of docID shifts (which'd require the
 genealogy tracking).

 Alas I think it will quickly get hairy.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays

[
https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695130#action_12695130
]

Jason Rutherglen commented on LUCENE-1574:
--

True the pool would hold onto spares, but they would expire.
It's mostly useful for the large on disk segments as those byte
arrays (for BitVectors) are large, and because there's more docs
in them would get hit with deletes more often, and so they'd be
reused fairly often.

I'm not knowledgeable enough to say whether the transactional
data structure will be fast enough. We had been using
http://fastutil.dsi.unimi.it/docs/it/unimi/dsi/fastutil/ints/IntR
BTreeSet.html in Zoie for deleted docs and it's way slow. Binary
search of an int array is faster, albeit not fast enough. The
multi dimensional array thing isn't fast enough (for searching)
as we implemented this in Bobo. It's implemented in Bobo because
we have a multi value field cache (which is quite large because
for each doc we're storing potentially 64 or more values in an
inplace bitset) and a single massive array kills the GC. In some
cases this is faster than a single large array because of the
way Java (or the OS?) transfers memory around through the CPU
cache.

PooledSegmentReader, pools SegmentReader underlying byte arrays
---

Key: LUCENE-1574
URL: https://issues.apache.org/jira/browse/LUCENE-1574
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/*
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Original Estimate: 168h
Remaining Estimate: 168h

PooledSegmentReader pools the underlying byte arrays of deleted docs and
norms for realtime search. It is designed for use with IndexReader.clone
which can create many copies of byte arrays, which are of the same length for
a given segment. When pooled they can be reused which could save on memory.
Do we want to benchmark the memory usage comparison of PooledSegmentReader vs
GC? Many times GC is enough for these smaller objects.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 What does Bobo use the cached bitsets for?

 Bobo is a faceting engine that uses custom field caches and sometimes cached
 bitsets rather than relying exclusively on bitsets to calculate facets.  It
 is useful where many facets (50+) need to be calculated and bitset caching,
 loading and intersection would be too costly.  Instead it iterates over in
 memory custom field caches while hit collecting.  Because we're also doing
 realtime search, making the loading more efficient via the in memory field
 cache merging is interesting.

OK.

Does it operate at the segment level?  Seems like that'd give you good
enough realtime performance (though merging in RAM will definitely be
faster).

 True, we do the in memory merging with deleted docs, norms would be good as
 well.

Yes, and eventually column stride fields.

 As a first step how should we expose the segments a segment has
 originated from?

I'm not sure; it's quite messy.  Each segment must track what other
segment it got merged to, and must hold a copy of its deletes as of
the time it was merged.  And each segment must know what other
segments it got merged with.

Is this really a serious problem in your realtime search?  Eg, from
John's numbers in using payloads to read in the docID - UID mapping,
it seems like you could make a Query that when given a reader would go
and do the Approach 2 to perform the deletes (if indeed you are
needing to delete thousands of docs with each update).  What sort of
docs/sec rates are you needing to handle?

 I would like to get this implemented for 2.9 as a building
 block that perhaps we can write other things on.

I think that's optimistic.  It's still at the
hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
understand that all other options have been exhausted first.

Especially once we have column stride fields and they are merged in
RAM, you'll be handed a reader pre-warmed and you can then jump
through those arrays to find docs to delete.

 Column stride fields still
 requires some encoding and merging field caches in ram would presumably be
 faster?

Yes, potentially much faster.  There's no sense in writing through to
disk until commit is called.

 Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where
 each generation is a renumbering event).

 Couldn't each SegmentReader keep a docmap and the names of the segments it
 originated from.  However the name is not enough of a unique key as there's
 the deleted docs that change?  It seems like we need a unique id for each
 segment reader, where the id is assigned to cloned readers (which normally
 have the same segment name as the original SR).  The ID could be a stamp
 (perhaps only given to readonlyreaders?).  That way the
 SegmentReader.getMergedFrom method does not need to return the actual
 readers, but a docmap and the parent readers IDs?  It would be assumed the
 user would be holding the readers somewhere?  Perhaps all this can be
 achieved with a callback in IW, and all this logic could be kept somewhat
 internal to Lucene?

The docMap is a costly way to store it, since it consumes 32 bits per
doc (vs storing a copy of the deleted docs).

But, then docMap gives you random-access on the map.

What if prior to merging, or committing merged deletes, there were a
callback to force the app to materialize any privately buffered
deletes?  And then the app is not allowed to use those readers for
further deletes?  Still kinda messy.

I think I need to understand better why delete by Query isn't viable
in your situation...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

 I think I need to understand better why delete by Query isn't
viable in your situation...

The delete by query is a separate problem which I haven't fully
explored yet. Tracking the segment genealogy is really an
interim step for merging field caches before column stride
fields gets implemented. Actually CSF cannot be used with Bobo's
field caches anyways which means we'd need a way to find out
about the segment parents.

 Does it operate at the segment level? Seems like that'd give
you good enough realtime performance (though merging in RAM will
definitely be faster).

We need to see how Bobo integrates with LUCENE-1483.

It seems like we've been talking about CSF for 2 years and there
isn't a patch for it? If I had more time I'd take a look. What
is the status of it?

I'll write a patch that implements a callback for the segment
merging such that the user can decide what information they want
to record about the merged SRs (I'm pretty sure there isn't a
way to do this with MergePolicy?)


On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  What does Bobo use the cached bitsets for?
 
  Bobo is a faceting engine that uses custom field caches and sometimes
 cached
  bitsets rather than relying exclusively on bitsets to calculate facets.
 It
  is useful where many facets (50+) need to be calculated and bitset
 caching,
  loading and intersection would be too costly.  Instead it iterates over
 in
  memory custom field caches while hit collecting.  Because we're also
 doing
  realtime search, making the loading more efficient via the in memory
 field
  cache merging is interesting.

 OK.

 Does it operate at the segment level?  Seems like that'd give you good
 enough realtime performance (though merging in RAM will definitely be
 faster).

  True, we do the in memory merging with deleted docs, norms would be good
 as
  well.

 Yes, and eventually column stride fields.

  As a first step how should we expose the segments a segment has
  originated from?

 I'm not sure; it's quite messy.  Each segment must track what other
 segment it got merged to, and must hold a copy of its deletes as of
 the time it was merged.  And each segment must know what other
 segments it got merged with.

 Is this really a serious problem in your realtime search?  Eg, from
 John's numbers in using payloads to read in the docID - UID mapping,
 it seems like you could make a Query that when given a reader would go
 and do the Approach 2 to perform the deletes (if indeed you are
 needing to delete thousands of docs with each update).  What sort of
 docs/sec rates are you needing to handle?

  I would like to get this implemented for 2.9 as a building
  block that perhaps we can write other things on.

 I think that's optimistic.  It's still at the
 hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
 understand that all other options have been exhausted first.

 Especially once we have column stride fields and they are merged in
 RAM, you'll be handed a reader pre-warmed and you can then jump
 through those arrays to find docs to delete.

  Column stride fields still
  requires some encoding and merging field caches in ram would presumably
 be
  faster?

 Yes, potentially much faster.  There's no sense in writing through to
 disk until commit is called.

  Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
 (where
  each generation is a renumbering event).
 
  Couldn't each SegmentReader keep a docmap and the names of the segments
 it
  originated from.  However the name is not enough of a unique key as
 there's
  the deleted docs that change?  It seems like we need a unique id for each
  segment reader, where the id is assigned to cloned readers (which
 normally
  have the same segment name as the original SR).  The ID could be a stamp
  (perhaps only given to readonlyreaders?).  That way the
  SegmentReader.getMergedFrom method does not need to return the actual
  readers, but a docmap and the parent readers IDs?  It would be assumed
 the
  user would be holding the readers somewhere?  Perhaps all this can be
  achieved with a callback in IW, and all this logic could be kept somewhat
  internal to Lucene?

 The docMap is a costly way to store it, since it consumes 32 bits per
 doc (vs storing a copy of the deleted docs).

 But, then docMap gives you random-access on the map.

 What if prior to merging, or committing merged deletes, there were a
 callback to force the app to materialize any privately buffered
 deletes?  And then the app is not allowed to use those readers for
 further deletes?  Still kinda messy.

 I think I need to understand better why delete by Query isn't viable
 in your situation...

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail:

Lucene filter

2009-04-02 Thread addman


How do you create a Lucene Filter to check if a field has a value?  It is
part for a ChainedFilter that I am creating.
-- 
View this message in context: 
http://www.nabble.com/Lucene-filter-tp22858220p22858220.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

2009-04-02 Thread John Wang

Just to clarify, Approach 1 and approach 2 are both currently performing ok
currently for us.
-John

On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  What does Bobo use the cached bitsets for?
 
  Bobo is a faceting engine that uses custom field caches and sometimes
 cached
  bitsets rather than relying exclusively on bitsets to calculate facets.
 It
  is useful where many facets (50+) need to be calculated and bitset
 caching,
  loading and intersection would be too costly.  Instead it iterates over
 in
  memory custom field caches while hit collecting.  Because we're also
 doing
  realtime search, making the loading more efficient via the in memory
 field
  cache merging is interesting.

 OK.

 Does it operate at the segment level?  Seems like that'd give you good
 enough realtime performance (though merging in RAM will definitely be
 faster).

  True, we do the in memory merging with deleted docs, norms would be good
 as
  well.

 Yes, and eventually column stride fields.

  As a first step how should we expose the segments a segment has
  originated from?

 I'm not sure; it's quite messy.  Each segment must track what other
 segment it got merged to, and must hold a copy of its deletes as of
 the time it was merged.  And each segment must know what other
 segments it got merged with.

 Is this really a serious problem in your realtime search?  Eg, from
 John's numbers in using payloads to read in the docID - UID mapping,
 it seems like you could make a Query that when given a reader would go
 and do the Approach 2 to perform the deletes (if indeed you are
 needing to delete thousands of docs with each update).  What sort of
 docs/sec rates are you needing to handle?

  I would like to get this implemented for 2.9 as a building
  block that perhaps we can write other things on.

 I think that's optimistic.  It's still at the
 hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
 understand that all other options have been exhausted first.

 Especially once we have column stride fields and they are merged in
 RAM, you'll be handed a reader pre-warmed and you can then jump
 through those arrays to find docs to delete.

  Column stride fields still
  requires some encoding and merging field caches in ram would presumably
 be
  faster?

 Yes, potentially much faster.  There's no sense in writing through to
 disk until commit is called.

  Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
 (where
  each generation is a renumbering event).
 
  Couldn't each SegmentReader keep a docmap and the names of the segments
 it
  originated from.  However the name is not enough of a unique key as
 there's
  the deleted docs that change?  It seems like we need a unique id for each
  segment reader, where the id is assigned to cloned readers (which
 normally
  have the same segment name as the original SR).  The ID could be a stamp
  (perhaps only given to readonlyreaders?).  That way the
  SegmentReader.getMergedFrom method does not need to return the actual
  readers, but a docmap and the parent readers IDs?  It would be assumed
 the
  user would be holding the readers somewhere?  Perhaps all this can be
  achieved with a callback in IW, and all this logic could be kept somewhat
  internal to Lucene?

 The docMap is a costly way to store it, since it consumes 32 bits per
 doc (vs storing a copy of the deleted docs).

 But, then docMap gives you random-access on the map.

 What if prior to merging, or committing merged deletes, there were a
 callback to force the app to materialize any privately buffered
 deletes?  And then the app is not allowed to use those readers for
 further deletes?  Still kinda messy.

 I think I need to understand better why delete by Query isn't viable
 in your situation...

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

Callback for intercepting merging segments in IndexWriter
-

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


For things like merging field caches or bitsets, it's useful to
know which segments were merged to create a new segment.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter


[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695185#action_12695185
 ] 

Jason Rutherglen commented on LUCENE-1516:
--

In ReaderPool.get(SegmentInfo info, boolean doOpenStores, int readBufferSize) 
the readBufferSize needs to be passed into SegmentReader.get

 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter


 [ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1584:
-

Attachment: LUCENE-1584.patch

Patch is combined with LUCENE-1516. 

IndexWriter has a setSegmentMergerCallback method that is called
in IW.mergeMiddle where the readers being merged and the newly
merged reader are passed to the SMC.mergedSegments method.

I think we need to expose the SegmentReader segment name somehow
either via IndexReader.getSegmentName or an interface on top of
SegmentReader?  

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

IndexWriter.addIndexesNoOptimize(IndexReader[] readers)