Hello,

These sound very interesting!

I think some of them would go under contrib (as utility tools?) and
others maybe into the core.  I've added more detailed comments below.

Stanislav Jordanov wrote:
Hi guys,

For the purpose of our product we've devised a bunch of small tool classes which handle various utility tasks like: 1. IndexRecoverer - assuming the "segments" file is missing or corrupted, this tool rebuilds it based on the *.cfs (and other) files found in the index dir (excludes files listed in deletable)

Excellent.  I know that various cases of "recovering an index" have
come up on the lists over time.  It would be great to have a single
tool that can try to correct the different problems that users hit, eg
removing a single unusable segments file, regenerating the segments
file, etc.

2. IndexSplitter - splits an existing index in 2, 3 or more relatively equally sized indices. It simply splits the segments files in distinct directories and the uses the IndexRecoverer to rebuild each new Index's segment file

Seems like a good tool for contrib?

3. IndexMerger - in reverse to IndexSplitter merges some indices into single index; Uses a modified version of IndexWriter.addIndexes - it does not optimize() in the beginning and in the end. This way the resulting index is not a single huge cfs file, which is desirable in some cases.

You should have a look at the current Lucene trunk: a new method
(called addIndexesNoOptimize) has been added that I think addresses
this same need.

4. IndexOptimizer - Optimizes existing index by merging the 'small' segments and compacting the large segments (compacting means 'removing the deleted docs within them'); Also converts to compound file format any old-style "spilled" segments.

Ooh -- this sounds like a lighter weight version of the current
"optimize"?  Compacting single segments would be particularly useful
for very large indices that receive many updates to each doc.  It
seems like this could be a new method on IndexWriter?

Though I think this could break the index segments invariants (new
merge policy in IndexWriter in the trunk) when there are many deletes
against the large older segments (I think a fairly typical use case
actually).

All of the above mentioned tools are classes within the org.apache.lucene.index package as they use some package-scope methods and properties (+ they feel like belonging there).

Now the design change suggestion - it is about the 'deletable' related code; according to the source comments - the delayed deletion of files through the 'deletable' is required on Window only as this OS prevents files opened for reading to be deleted. Working on the IndexOptimizer tool I found myself in a situation where I needed to 'safe delete' a bunch of obsolete segments while having only an (FS)Directory and a segment file name. And the 'safe delete' feature is in IndexWriter. Then after reviewing the code I came to the conclusion that the 'safe delete' feature logically belongs to the (FS)Directory class, not to IndexWriter. I was able to move the corresponding code from IndexWriter to (FS)Directory IMO this way is better.

You should also look at the trunk for this one.  The deletion logic
has moved into a separate class (IndexFileDeleter) which handles
figuring out which files 1) look to be Lucene index files, but 2) are
not in fact referenced by the current segments file, and then
safely deletes them (retries).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to