[jira] Resolved: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-702?page=all ]

Michael McCandless resolved LUCENE-702.
---

Fix Version/s: 2.1
   Resolution: Fixed

 Disk full during addIndexes(Directory[]) can corrupt index
 --

 Key: LUCENE-702
 URL: http://issues.apache.org/jira/browse/LUCENE-702
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
 Fix For: 2.1

 Attachments: LUCENE-702.patch, LUCENE-702.take2.patch, 
 LUCENE-702.take3.patch


 This is a spinoff of LUCENE-555
 If the disk fills up during this call then the committed segments file can 
 reference segments that were not written.  Then the whole index becomes 
 unusable.
 Does anyone know of any other cases where disk full could corrupt the index?
 I think disk full should worse lose the documents that were in flight at 
 the time.  It shouldn't corrupt the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)

2006-12-18 Thread Bernhard Messer

Otis,

i figured out a similar problem when running a very heavy loaded search 
application in a servlet container. The reasone using ThreadLocals was 
to get rid of synchronized method calls e.g in TermVectorsReader which 
would break down the overall search performance. Currently i do not see 
an easy solution to fix both, the synchronization and ThreadLocal problem.


Bernhard

Otis Gospodnetic wrote:

Moving to java-dev, I think this belongs here.
I've been looking at this problem some more today and reading about 
ThreadLocals.  It's easy to misuse them and end up with memory leaks, 
apparently... and I think we may have this problem here.

The problem here is that ThreadLocals are tied to Threads, and I think the 
assumption in TermInfosReader and SegmentReader is that (search) Threads are 
short-lived: they come in, scan the index, do the search, return and die.  In 
this scenario, their ThreadLocals go to heaven with them, too, and memory is 
freed up.

But when Threads are long-lived, as they are in thread pools (e.g. those in 
servlet containers), those ThreadLocals stay alive even after a single search 
request is done.  Moreover, the Thread is reused, and the new TermInfosReader 
and SegmentReader put some new values in that ThreadLocal on top of the old 
values (I think) from the previous search request.  Because the Thread still 
has references to ThreadLocals and the values in them, the values never get 
GCed.

I tried making ThreadLocals in TIR and SR static, I tried wrapping values saved 
in TLs in WeakReference, I've tried using WeakHashMap like in Robert Engel's 
FixedThreadLocal class from LUCENE-436, but nothing helped.  I thought about 
adding a public static method to TIR and SR, so one could call it at the end of 
a search request (think servlet filter) and clear the TL for the current 
thread, but that would require making TIR and SR public and I'm not 100% sure 
if it would work, plus that exposes the implementation details too much.
I don't have a solution yet.
But do we *really* need ThreadLocal in TIR and SR?  The only thing that TL is 
doing there is acting as a per-thread storage of some cloned value (in TIR we 
clone SegmentTermEnum and in SR we clone TermVectorsReader).  Why can't we just 
store those cloned values in instance variables?  Isn't whoever is calling TIR 
and SR going to be calling the same instance of TIR and SR anyway, and thus get 
access to those cloned values?

I'm really amazed that we haven't heard any reports about this before.  I am 
not sure why my application started showing this leak only about 3 weeks ago.  
It is getting more pounded on than before, so maybe that made the leak more 
obvious.  My guess is that more common Lucene usage is with a single index or a 
small number of them, and with short-lived threads, where this problem isn't 
easily visible.  In my case I deal with a few tens of thousands of indices and 
several parallel search threads that live forever in the thread pool.

Any thoughts about this or possible suggestions for a fix?
Thanks,
Otis



- Original Message 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Friday, December 15, 2006 12:28:29 PM
Subject: Leaking org.apache.lucene.index.* objects

Hi,

About 2-3 weeks ago I emailed about a memory leak in my application.  I then found some 
problems in my code (I wasn't closing IndexSearchers explicitly) and took care of those.  
Now I see my app is still leaking memory - jconsole clearly shows the Tenured 
Gen memory pool getting filled up until I hit the OOM, but I can't seem to 
pin-point the source.

I found that a bunch or o.a.l.index.* objects are not getting GCed, even though 
they should.  For example:

$ jmap -histo:live 7825 | grep apache.lucene.index | head -20 | sort -k2 -nr
num   #instances#bytes  class name
--
  4:   176484098831040  
org.apache.lucene.index.CompoundFileReader$CSIndexInput
  5:   211921567814880  org.apache.lucene.index.TermInfo
  7:   111245935598688  org.apache.lucene.index.SegmentReader$Norm
  9:   213231134116976  org.apache.lucene.index.Term
 12:   111789726829528  org.apache.lucene.index.FieldInfo
 13:22534018027200  org.apache.lucene.index.SegmentTermEnum
 15:58972714153448  org.apache.lucene.index.TermBuffer
 21: 86033 8718504  [Lorg.apache.lucene.index.TermInfo;
 20: 86033 8718504  [Lorg.apache.lucene.index.Term;
 23: 86120 7578560  org.apache.lucene.index.SegmentReader
 26: 90501 5068056  org.apache.lucene.store.FSIndexInput
 27: 86120 4822720  org.apache.lucene.index.TermInfosReader
 33: 86130 3445200  org.apache.lucene.index.SegmentInfo
 36: 87355 2795360  org.apache.lucene.store.FSIndexInput$Descriptor
 38: 86120 2755840  org.apache.lucene.index.FieldsReader
 39: 86050 2753600  org.apache.lucene.index.CompoundFileReader
 42: 46903 2251344  

[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459419 ] 

Michael McCandless commented on LUCENE-748:
---


I think this (not releasing write lock on hitting an exception) is
actually by design.  It's because the writer still has pending changes
to commit to disk.

And, with the fix for LUCENE-702 (just committed), if we hit an
exception during IndexWriter.close(), the IndexWriter is left in a
consistent state (this is not quite the case pre-2.1).

Meaning, if you caught that exception, fixed the root cause (say freed
up disk space), and called close again (successfully), you would not
have lost any documents, and the write lock will be released.

I can also see that if we did release the write lock on exception,
this could dangerously / easily mask the fact that there was an
exception.  Ie, if the IOException is caught and ignored (or writes a
message but nobody sees it), and the write lock was released, then you
could go for quite a while before discovering eg that new docs weren't
visible in the index.  Whereas, keeping the write lock held on
exception will cause much faster discovery of the problem (eg when the
next writer tries to instantiate).

I think this is the right exception semantics to aim for?  Ie if the
close did not succeed we should not release the write lock (because we
still have pending changes).

Then, if you want to force releasing of the write lock, you can still
do something like this:

  try {
writer.close();
  } finally {
if (IndexReader.isLocked(directory)) {
  IndexReader.unlock(directory);
}
  }



 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stephen Hussey is out of the office.

2006-12-18 Thread Stephen Hussey

I will be out of the office starting  12/18/2006 and will not return until
12/20/2006.

I will respond to your message when I return.

[jira] Closed: (LUCENE-658) upload major releases to ibiblio

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-658?page=all ]

Michael McCandless closed LUCENE-658.
-

Resolution: Duplicate

Dup of LUCENE-551

 upload major releases to ibiblio
 

 Key: LUCENE-658
 URL: http://issues.apache.org/jira/browse/LUCENE-658
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Affects Versions: 1.9, 2.0.0
Reporter: Ryan Sonnek

 i'm a current user of maven and the latest 1.9 and 2.0 releases are not 
 available on ibiblio.
 http://www.ibiblio.org/maven2/lucene/lucene/
 Could someone upload the latest versions so that use maven-heads can access 
 the new features?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-734?page=all ]

Michael McCandless resolved LUCENE-734.
---

Resolution: Fixed

From the last comment, it looks like the 2.0 Lucene core JAR is in maven 1.

 Upload Lucene 2.0 artifacts in the Maven 1 repository
 -

 Key: LUCENE-734
 URL: http://issues.apache.org/jira/browse/LUCENE-734
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Reporter: Jukka Zitting
Priority: Minor

 The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in 
 the Maven 1 repository. There are still projects using Maven 1 who might be 
 interested in upgrading to Lucene 2, so having the artifacts also in the 
 Maven 1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)

2006-12-18 Thread robert engels
There is no inherent problem with ThreadLocal. It is a viable  
solution to synchronization issues in most cases.


On Dec 18, 2006, at 11:25 AM, Bernhard Messer wrote:


Otis,

i figured out a similar problem when running a very heavy loaded  
search application in a servlet container. The reasone using  
ThreadLocals was to get rid of synchronized method calls e.g in  
TermVectorsReader which would break down the overall search  
performance. Currently i do not see an easy solution to fix both,  
the synchronization and ThreadLocal problem.


Bernhard

Otis Gospodnetic wrote:

Moving to java-dev, I think this belongs here.
I've been looking at this problem some more today and reading  
about ThreadLocals.  It's easy to misuse them and end up with  
memory leaks, apparently... and I think we may have this problem  
here.


The problem here is that ThreadLocals are tied to Threads, and I  
think the assumption in TermInfosReader and SegmentReader is that  
(search) Threads are short-lived: they come in, scan the index, do  
the search, return and die.  In this scenario, their ThreadLocals  
go to heaven with them, too, and memory is freed up.


But when Threads are long-lived, as they are in thread pools (e.g.  
those in servlet containers), those ThreadLocals stay alive even  
after a single search request is done.  Moreover, the Thread is  
reused, and the new TermInfosReader and SegmentReader put some new  
values in that ThreadLocal on top of the old values (I think) from  
the previous search request.  Because the Thread still has  
references to ThreadLocals and the values in them, the values  
never get GCed.


I tried making ThreadLocals in TIR and SR static, I tried wrapping  
values saved in TLs in WeakReference, I've tried using WeakHashMap  
like in Robert Engel's FixedThreadLocal class from LUCENE-436, but  
nothing helped.  I thought about adding a public static method to  
TIR and SR, so one could call it at the end of a search request  
(think servlet filter) and clear the TL for the current thread,  
but that would require making TIR and SR public and I'm not 100%  
sure if it would work, plus that exposes the implementation  
details too much.

I don't have a solution yet.
But do we *really* need ThreadLocal in TIR and SR?  The only thing  
that TL is doing there is acting as a per-thread storage of some  
cloned value (in TIR we clone SegmentTermEnum and in SR we clone  
TermVectorsReader).  Why can't we just store those cloned values  
in instance variables?  Isn't whoever is calling TIR and SR going  
to be calling the same instance of TIR and SR anyway, and thus get  
access to those cloned values?


I'm really amazed that we haven't heard any reports about this  
before.  I am not sure why my application started showing this  
leak only about 3 weeks ago.  It is getting more pounded on than  
before, so maybe that made the leak more obvious.  My guess is  
that more common Lucene usage is with a single index or a small  
number of them, and with short-lived threads, where this problem  
isn't easily visible.  In my case I deal with a few tens of  
thousands of indices and several parallel search threads that live  
forever in the thread pool.


Any thoughts about this or possible suggestions for a fix?
Thanks,
Otis



- Original Message 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Friday, December 15, 2006 12:28:29 PM
Subject: Leaking org.apache.lucene.index.* objects

Hi,

About 2-3 weeks ago I emailed about a memory leak in my  
application.  I then found some problems in my code (I wasn't  
closing IndexSearchers explicitly) and took care of those.  Now I  
see my app is still leaking memory - jconsole clearly shows the  
Tenured Gen memory pool getting filled up until I hit the OOM,  
but I can't seem to pin-point the source.


I found that a bunch or o.a.l.index.* objects are not getting  
GCed, even though they should.  For example:


$ jmap -histo:live 7825 | grep apache.lucene.index | head -20 |  
sort -k2 -nr

num   #instances#bytes  class name
--
  4:   176484098831040   
org.apache.lucene.index.CompoundFileReader$CSIndexInput

  5:   211921567814880  org.apache.lucene.index.TermInfo
  7:   111245935598688  org.apache.lucene.index.SegmentReader 
$Norm

  9:   213231134116976  org.apache.lucene.index.Term
 12:   111789726829528  org.apache.lucene.index.FieldInfo
 13:22534018027200  org.apache.lucene.index.SegmentTermEnum
 15:58972714153448  org.apache.lucene.index.TermBuffer
 21: 86033 8718504  [Lorg.apache.lucene.index.TermInfo;
 20: 86033 8718504  [Lorg.apache.lucene.index.Term;
 23: 86120 7578560  org.apache.lucene.index.SegmentReader
 26: 90501 5068056  org.apache.lucene.store.FSIndexInput
 27: 86120 4822720  org.apache.lucene.index.TermInfosReader
 33: 86130 3445200  org.apache.lucene.index.SegmentInfo
 

Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread robert engels

A word of caution here...

Using a shared FileChannel.pread actually performs a synchronization  
under Windows.


See JDK bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734

I submitted this, and it was verified using the supplied test case.


On Dec 17, 2006, at 1:31 PM, Doug Cutting wrote:


Doron Cohen wrote:
Also, if nio proves to be faster in this scenario, it might make  
sense to

keep current FSDirectory, and just add FSDirectoryNio implementation.


If nio isn't considerably slower for single-threaded applications,  
I'd vote to simply switch FSDirectory to use nio, simplifying the  
public API by reducing choices.  But if classic io is faster for  
single-threaded apps, and nio faster for multi-threaded, that would  
suggest adding a new, public, nio-based Directory implementation.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread Doug Cutting

robert engels wrote:
Using a shared FileChannel.pread actually performs a synchronization 
under Windows.


Sigh.  Still, it'd be no worse than current FSDirectory on Windows.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread robert engels
I think the important issues are index size, stability and number of  
concurrent readers.


We achieved the best performance by using a pool of file descriptors  
to a segment so we could avoid the synchronization block, but this  
only worked for large, relatively unchanging segments.



On Dec 18, 2006, at 2:51 PM, Doug Cutting wrote:


robert engels wrote:
Using a shared FileChannel.pread actually performs a  
synchronization under Windows.


Sigh.  Still, it'd be no worse than current FSDirectory on Windows.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Closed: (LUCENE-603) index optimize problem

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-603?page=all ]

Michael McCandless closed LUCENE-603.
-

Resolution: Duplicate

This looks like a dup of LUCENE-140

 index optimize problem
 --

 Key: LUCENE-603
 URL: http://issues.apache.org/jira/browse/LUCENE-603
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 1.9
 Environment: CentOS 4.0 , Lucene 1.9, Eclipse 3.1
Reporter: Dedian Guo

 have a function whichi is loop to index batches of documents, after each 
 indexing, the function IndexWriter.optimize will be applied. for several 
 times (not sure how many, but should be many), following exception was thrown 
 out.
 Exception in thread Thread-0 java.lang.IllegalStateException: docs out of 
 order
   at 
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:335)
   at 
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:298)
   at 
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:272)
   at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:236)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:89)
   at 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
   at 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
   at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-140) docs out of order

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-140?page=comments#action_12459457 ] 

Michael McCandless commented on LUCENE-140:
---

I just resolved LUCENE-603 as a dup of this issue.

It would be awesome if we could get a test case that shows this happening.  
Enough people seem to hit it that it seems likely something is lurking out 
there so I'd like to get it fixed!!

 docs out of order
 -

 Key: LUCENE-140
 URL: http://issues.apache.org/jira/browse/LUCENE-140
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: unspecified
 Environment: Operating System: Linux
 Platform: PC
Reporter: legez
 Assigned To: Lucene Developers
 Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar


 Hello,
   I can not find out, why (and what) it is happening all the time. I got an
 exception:
 java.lang.IllegalStateException: docs out of order
 at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
 at 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
 at Optimize.main(Optimize.java:29)
 It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
 find
 it neither in download nor in version list in this form). Everything seems 
 OK. I
 can search through index, but I can not optimize it. Even worse after this
 exception every time I add new documents and close IndexWriter new segments is
 created! I think it has all documents added before, because of its size.
 My index is quite big: 500.000 docs, about 5gb of index directory.
 It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
 docs, try to optimize and receive above exception.
 My documents' structure is:
   static Document indexIt(String id_strony, Reader reader, String 
 data_wydania,
 String id_wydania, String id_gazety, String data_wstawienia)
 {
 Document doc = new Document();
 doc.add(Field.Keyword(id, id_strony ));
 doc.add(Field.Keyword(data_wydania, data_wydania));
 doc.add(Field.Keyword(id_wydania, id_wydania));
 doc.add(Field.Text(id_gazety, id_gazety));
 doc.add(Field.Keyword(data_wstawienia, data_wstawienia));
 doc.add(Field.Text(tresc, reader));
 return doc;
 }
 Sincerely,
 legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-140) docs out of order

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-140?page=comments#action_12459457 ] 

Michael McCandless commented on LUCENE-140:
---

I just resolved LUCENE-603 as a dup of this issue.

It would be awesome if we could get a test case that shows this happening.  
Enough people seem to hit it that it seems likely something is lurking out 
there so I'd like to get it fixed!!

 docs out of order
 -

 Key: LUCENE-140
 URL: http://issues.apache.org/jira/browse/LUCENE-140
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: unspecified
 Environment: Operating System: Linux
 Platform: PC
Reporter: legez
 Assigned To: Lucene Developers
 Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar


 Hello,
   I can not find out, why (and what) it is happening all the time. I got an
 exception:
 java.lang.IllegalStateException: docs out of order
 at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
 at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
 at 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
 at Optimize.main(Optimize.java:29)
 It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
 find
 it neither in download nor in version list in this form). Everything seems 
 OK. I
 can search through index, but I can not optimize it. Even worse after this
 exception every time I add new documents and close IndexWriter new segments is
 created! I think it has all documents added before, because of its size.
 My index is quite big: 500.000 docs, about 5gb of index directory.
 It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
 docs, try to optimize and receive above exception.
 My documents' structure is:
   static Document indexIt(String id_strony, Reader reader, String 
 data_wydania,
 String id_wydania, String id_gazety, String data_wstawienia)
 {
 Document doc = new Document();
 doc.add(Field.Keyword(id, id_strony ));
 doc.add(Field.Keyword(data_wydania, data_wydania));
 doc.add(Field.Keyword(id_wydania, id_wydania));
 doc.add(Field.Text(id_gazety, id_gazety));
 doc.add(Field.Keyword(data_wstawienia, data_wstawienia));
 doc.add(Field.Text(tresc, reader));
 return doc;
 }
 Sincerely,
 legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-129) Finalizers are non-canonical

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-129?page=all ]

Michael McCandless reassigned LUCENE-129:
-

Assignee: Michael McCandless  (was: Lucene Developers)

 Finalizers are non-canonical
 

 Key: LUCENE-129
 URL: http://issues.apache.org/jira/browse/LUCENE-129
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: unspecified
 Environment: Operating System: other
 Platform: All
Reporter: Esmond Pitt
 Assigned To: Michael McCandless
Priority: Minor

 The canonical form of a Java finalizer is:
 protected void finalize() throws Throwable()
 {
  try
  {
// ... local code to finalize this class
  }
  catch (Throwable t)
  {
  }
  super.finalize(); // finalize base class.
 }
 The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
 This
 is probably minor or null in effect, but the principle is important.
 As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
 removed, as it doesn't do anything that RandomAccessFile.finalize would do
 automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-129) Finalizers are non-canonical

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-129?page=all ]

Michael McCandless reassigned LUCENE-129:
-

Assignee: Michael McCandless  (was: Lucene Developers)

 Finalizers are non-canonical
 

 Key: LUCENE-129
 URL: http://issues.apache.org/jira/browse/LUCENE-129
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: unspecified
 Environment: Operating System: other
 Platform: All
Reporter: Esmond Pitt
 Assigned To: Michael McCandless
Priority: Minor

 The canonical form of a Java finalizer is:
 protected void finalize() throws Throwable()
 {
  try
  {
// ... local code to finalize this class
  }
  catch (Throwable t)
  {
  }
  super.finalize(); // finalize base class.
 }
 The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
 This
 is probably minor or null in effect, but the principle is important.
 As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
 removed, as it doesn't do anything that RandomAccessFile.finalize would do
 automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12459479 ] 

Michael McCandless commented on LUCENE-301:
---

I like Doug's solution add a new constructor, IndexWriter(Directory, Analyzer) 
which, if no such index exists, creates one -- I think this makses sense.  I 
will commit this.

 Index Writer constructor flags unclear - and annoying in certain cases
 --

 Key: LUCENE-301
 URL: http://issues.apache.org/jira/browse/LUCENE-301
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Dan Armbrust
 Assigned To: Lucene Developers
Priority: Minor

 Wouldn't it make more sense if the constructor for the IndexWriter always
 created an index if it doesn't exist - and the boolean parameter should be
 clear (instead of create)
 So instead of this (from javadoc):
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If create is true, then a new, empty index will be created in d, replacing the
 index already there, if any.
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to
 append to the existing index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist, and create is false
 We would have this:
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If clear is true, and a index exists at location d, then it will be erased, 
 and
 a new, empty index will be created in d.
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the 
 existing
 index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist.
 Its current behavior is kind of annoying, because I have an app that should
 never clear an existing index, it should always append.  So I want create set 
 to
 false.  But when I am starting a brand new index, then I have to change the
 create flag to keep it from throwing an exception...  I guess for now I will
 have to write code to check if a index actually has content yet, and if it
 doesn't, change the flag on the fly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12459479 ] 

Michael McCandless commented on LUCENE-301:
---

I like Doug's solution add a new constructor, IndexWriter(Directory, Analyzer) 
which, if no such index exists, creates one -- I think this makses sense.  I 
will commit this.

 Index Writer constructor flags unclear - and annoying in certain cases
 --

 Key: LUCENE-301
 URL: http://issues.apache.org/jira/browse/LUCENE-301
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Dan Armbrust
 Assigned To: Lucene Developers
Priority: Minor

 Wouldn't it make more sense if the constructor for the IndexWriter always
 created an index if it doesn't exist - and the boolean parameter should be
 clear (instead of create)
 So instead of this (from javadoc):
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If create is true, then a new, empty index will be created in d, replacing the
 index already there, if any.
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to
 append to the existing index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist, and create is false
 We would have this:
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If clear is true, and a index exists at location d, then it will be erased, 
 and
 a new, empty index will be created in d.
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the 
 existing
 index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist.
 Its current behavior is kind of annoying, because I have an app that should
 never clear an existing index, it should always append.  So I want create set 
 to
 false.  But when I am starting a brand new index, then I have to change the
 create flag to keep it from throwing an exception...  I guess for now I will
 have to write code to check if a index actually has content yet, and if it
 doesn't, change the flag on the fly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-301?page=all ]

Michael McCandless reassigned LUCENE-301:
-

Assignee: Michael McCandless  (was: Lucene Developers)

 Index Writer constructor flags unclear - and annoying in certain cases
 --

 Key: LUCENE-301
 URL: http://issues.apache.org/jira/browse/LUCENE-301
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Dan Armbrust
 Assigned To: Michael McCandless
Priority: Minor

 Wouldn't it make more sense if the constructor for the IndexWriter always
 created an index if it doesn't exist - and the boolean parameter should be
 clear (instead of create)
 So instead of this (from javadoc):
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If create is true, then a new, empty index will be created in d, replacing the
 index already there, if any.
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to
 append to the existing index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist, and create is false
 We would have this:
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If clear is true, and a index exists at location d, then it will be erased, 
 and
 a new, empty index will be created in d.
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the 
 existing
 index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist.
 Its current behavior is kind of annoying, because I have an app that should
 never clear an existing index, it should always append.  So I want create set 
 to
 false.  But when I am starting a brand new index, then I have to change the
 create flag to keep it from throwing an exception...  I guess for now I will
 have to write code to check if a index actually has content yet, and if it
 doesn't, change the flag on the fly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-301?page=all ]

Michael McCandless reassigned LUCENE-301:
-

Assignee: Michael McCandless  (was: Lucene Developers)

 Index Writer constructor flags unclear - and annoying in certain cases
 --

 Key: LUCENE-301
 URL: http://issues.apache.org/jira/browse/LUCENE-301
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Dan Armbrust
 Assigned To: Michael McCandless
Priority: Minor

 Wouldn't it make more sense if the constructor for the IndexWriter always
 created an index if it doesn't exist - and the boolean parameter should be
 clear (instead of create)
 So instead of this (from javadoc):
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean create)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If create is true, then a new, empty index will be created in d, replacing the
 index already there, if any.
 Parameters:
 d - the index directory
 a - the analyzer to use
 create - true to create the index or overwrite the existing one; false to
 append to the existing index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist, and create is false
 We would have this:
 IndexWriter
 public IndexWriter(Directory d,
Analyzer a,
boolean clear)
 throws IOException
 Constructs an IndexWriter for the index in d. Text will be analyzed with 
 a.
 If clear is true, and a index exists at location d, then it will be erased, 
 and
 a new, empty index will be created in d.
 Parameters:
 d - the index directory
 a - the analyzer to use
 clear - true to overwrite the existing one; false to append to the 
 existing
 index 
 Throws:
 IOException - if the directory cannot be read/written to, or if it does 
 not
 exist.
 Its current behavior is kind of annoying, because I have an app that should
 never clear an existing index, it should always append.  So I want create set 
 to
 false.  But when I am starting a brand new index, then I have to change the
 create flag to keep it from throwing an exception...  I guess for now I will
 have to write code to check if a index actually has content yet, and if it
 doesn't, change the flag on the fly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Paul Elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459482 ] 

Paul Elschot commented on LUCENE-565:
-

I'd like to give this a try over the upcoming holidays.
Would it be possible to post a single patch?
A single patch can be made by locally svn add'ing all new files
and then doing an svn diff on all files involved from the top directory.

Regards,
Paul Elschot


 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: IndexWriter.java, IndexWriter.July09.patch, 
 IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
 NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
 newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
 perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 

[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Hoss Man (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459483 ] 

Hoss Man commented on LUCENE-748:
-

given the changes made in LUCENE-702, i concur with your assesment Michael: 
keeping the lock open so that the caller can attempt to deal with the problem 
then retry makes sense.

even if we decided that the consistent state of the IndexWriter isn't an 
invarient that the user can rely on, asking users to forcably unlock in the 
event of an exception on close seems like a more reasonable expectation then to 
forcably unlock for them automatically.

 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.java)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
 NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
 newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
 perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.July09.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
 NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
 newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
 perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
 NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
 newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
 perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexModifier.July09.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
 perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
 TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional 

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexWriter.Aug23.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
 perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
 TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional 

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexWriter.July18.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
 perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
 TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional 

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: TestWriterDelete.java)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: KeepDocCount0Segment.Sept15.patch, 
 NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
 perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
 TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, 

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: KeepDocCount0Segment.Sept15.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
 perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: newMergePolicy.Sept08.patch)

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
 perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the proposed changes, the performance
 improved by 60% when inserts and deletes were interleaved in small batches.
 Regards,
 Ning
 Ning Li
 Search Technologies
 IBM Almaden Research Center
 650 Harry Road
 San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Jed Wesley-Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459489 ] 

Jed Wesley-Smith commented on LUCENE-748:
-

I guess, particularly in light of LUCENE-702 that this behavior is OK - and the 
IndexReader.unlock(dir) is a good suggestion. My real problem was that the 
finalize() method does eventually remove the write lock. 

For me then the suggestion would be to document the exceptional behavior of the 
close() method (ie. it means that changes haven't been written and the write 
lock is still held) and link to the IndexReader.unlock(Directory) method.

 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459490 ] 

Ning Li commented on LUCENE-565:


Many versions of the patch were submitted as new code was committed to 
IndexWriter.java. For each version, all changes made were included in a single 
patch file.

I removed all but the latest version of the patch. Even this one is outdated by 
the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of 
LUCENE-702 before submitting another patch. LUCENE-702 was committed this 
morning. So I'll submit an up-to-date patch over the holidays.

On 12/18/06, Paul Elschot (JIRA) [EMAIL PROTECTED] wrote:
 I'd like to give this a try over the upcoming holidays. 

That's great! We can discuss/compare the designs then. Or, we can 
discuss/compare the designs before submitting new patches.

 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
 perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before indexes were built, the input documents were parsed
 to remove the HTML from them (i.e., only the text was indexed). This was
 done to minimize the impact of parsing on performance. A simple
 WhitespaceAnalyzer was used during index build.
 We experimented with three workloads:
   - Insert only. 1.6M documents were inserted and the final
 index size was 2.3GB.
   - Insert/delete (big batches). The same documents were
 inserted, but 25% were deleted. 1000 documents were
 deleted for every 4000 inserted.
   - Insert/delete (small batches). In this case, 5 documents
 were deleted for every 20 inserted.
 current   current  new
 Workload  IndexWriter  IndexModifier   IndexWriter
 ---
 Insert only 116 min   119 min116 min
 Insert/delete (big batches)   --  135 min125 min
 Insert/delete (small batches) --  338 min134 min
 As the experiments show, with the 

[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459491 ] 

Michael McCandless commented on LUCENE-748:
---

OK I will update the javadoc for IndexWriter.close to make this clear.  Thanks!

 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Michael McCandless (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-748?page=all ]

Michael McCandless reassigned LUCENE-748:
-

Assignee: Michael McCandless

 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith
 Assigned To: Michael McCandless

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock

2006-12-18 Thread Jed Wesley-Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459502 ] 

Jed Wesley-Smith commented on LUCENE-748:
-

Awesome, thanks!

 Exception during IndexWriter.close() prevents release of the write.lock
 ---

 Key: LUCENE-748
 URL: http://issues.apache.org/jira/browse/LUCENE-748
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9
 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14)
Reporter: Jed Wesley-Smith
 Assigned To: Michael McCandless

 After encountering a case of index corruption - see 
 http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method 
 encounters an exception in the flushRamSegments() method, the index 
 write.lock is not released (ie. it is not really closed).
 The writelock is only released when the IndexWriter is GC'd and finalize() is 
 called.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459506 ] 

Ning Li commented on LUCENE-565:


Here is the design overview. Minor changes were made because of lock-less 
commits.

In the current IndexWriter, newly added documents are buffered in ram in the 
form of one-doc segments.
When a flush is triggered, all ram documents are merged into a single segment 
and written to disk.
Further merges of disk segments may be triggered.

NewIndexModifier extends IndexWriter and supports document deletion in addition 
to document addition.
NewIndexModifier not only buffers newly added documents in ram, but also 
buffers deletes in ram.
The following describes what happens when a flush is triggered:

  1 merge ram documents into one segment and written to disk
do not commit - segmentInfos is updated in memory, but not written to disk

  2 for each disk segment to which a delete may apply
  open reader
  delete docs*, write new .delN file (* Care is taken to ensure that an 
interleaved sequence of
inserts and deletes for the same document are properly serialized.)
  close reader, but do not commit - segmentInfos is updated in memory, but 
not written to disk

  3 commit - write new segments_N to disk

Further merges for disk segments work the same as before.


As an option, we can cache readers to minimize the number of reader 
opens/closes. In other words,
we can trade memory for better performance. The design would be modified as 
follows:

  1 same as above

  2 for each disk segment to which a delete may apply
  open reader and cache it if not already opened/cached
  delete docs*, write new .delN file

  3 commit - write new segments_N to disk

The logic for disk segment merge changes accordingly: open reader if not 
already opened/cached;
after a merge is complete, close readers for the segments that have been merged.


 Supporting deleteDocuments in IndexWriter (Code and Performance Results 
 Provided)
 -

 Key: LUCENE-565
 URL: http://issues.apache.org/jira/browse/LUCENE-565
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
 Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
 perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java


 Today, applications have to open/close an IndexWriter and open/close an
 IndexReader directly or indirectly (via IndexModifier) in order to handle a
 mix of inserts and deletes. This performs well when inserts and deletes
 come in fairly large batches. However, the performance can degrade
 dramatically when inserts and deletes are interleaved in small batches.
 This is because the ramDirectory is flushed to disk whenever an IndexWriter
 is closed, causing a lot of small segments to be created on disk, which
 eventually need to be merged.
 We would like to propose a small API change to eliminate this problem. We
 are aware that this kind change has come up in discusions before. See
 http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
 . The difference this time is that we have implemented the change and
 tested its performance, as described below.
 API Changes
 ---
 We propose adding a deleteDocuments(Term term) method to IndexWriter.
 Using this method, inserts and deletes can be interleaved using the same
 IndexWriter.
 Note that, with this change it would be very easy to add another method to
 IndexWriter for updating documents, allowing applications to avoid a
 separate delete and insert to update a document.
 Also note that this change can co-exist with the existing APIs for deleting
 documents using an IndexReader. But if our proposal is accepted, we think
 those APIs should probably be deprecated.
 Coding Changes
 --
 Coding changes are localized to IndexWriter. Internally, the new
 deleteDocuments() method works by buffering the terms to be deleted.
 Deletes are deferred until the ramDirectory is flushed to disk, either
 because it becomes full or because the IndexWriter is closed. Using Java
 synchronization, care is taken to ensure that an interleaved sequence of
 inserts and deletes for the same document are properly serialized.
 We have attached a modified version of IndexWriter in Release 1.9.1 with
 these changes. Only a few hundred lines of coding changes are needed. All
 changes are commented by CHANGE. We have also attached a modified version
 of an example from Chapter 2.2 of Lucene in Action.
 Performance Results
 ---
 To test the performance our proposed changes, we ran some experiments using
 the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
 Xeon server running Linux. The disk storage was configured as RAID0 array
 with 5 drives. Before 

access policy for Java Open Review Project

2006-12-18 Thread Brian Chess
Hi all, I've been busy creating JOR accounts this weekend, and it was cool
to see so many names from Lucene.  Lucene, Solr, and Nutch have the lowest
defect rates among the projects we've looked at, and I'm beginning to see
why.

One of the things JOR is doing is inviting people to come and help review
issues we find with static analysis.  We've had a fair number of signups
since the project was on slashdot.

My question is, would you like to allow outsiders to go through results and
help sort the real bugs from the chaff?  The upside is that volunteers may
perform useful work and that it may be another avenue to get people involved
with the code.  The down side is that things like XSS in admin pages may
lead them to make more ruckus than is really appropriate.

The situation may change if we can establish a mechanism for efficiently
moving issues into Jira, but for now, I could imagine a number of different
policies, including:
  - Allow anyone access who asks for it.
  - Allow access on a case-by-case basis.
  - Don't allow access to outsiders.

Here are the outsiders who've requested access so far, along with a few
words to summarize what they've told me about themselves.

Lucene
--
Varun Nair [EMAIL PROTECTED]: budding code auditor at TCS
Martin Englund [EMAIL PROTECTED]: Experienced auditor at Sun
[EMAIL PROTECTED]: Looks like he's just testing the waters

Lucene, Nutch, Solr
--
Thierry De Leeuw [EMAIL PROTECTED]: experienced vulnerability hunter
Michael Bunzel [EMAIL PROTECTED]: experienced auditor, but new to
auditing Java

Thoughts?
Brian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]