Hello,
These sound very interesting!
I think some of them would go under contrib (as utility tools?) and
others maybe into the core. I've added more detailed comments below.
Stanislav Jordanov wrote:
Hi guys,
For the purpose of our product we've devised a bunch of small tool
classes which handle various utility tasks like:
1. IndexRecoverer - assuming the "segments" file is missing or
corrupted, this tool rebuilds it based on the *.cfs (and other) files
found in the index dir (excludes files listed in deletable)
Excellent. I know that various cases of "recovering an index" have
come up on the lists over time. It would be great to have a single
tool that can try to correct the different problems that users hit, eg
removing a single unusable segments file, regenerating the segments
file, etc.
2. IndexSplitter - splits an existing index in 2, 3 or more relatively
equally sized indices. It simply splits the segments files in distinct
directories and the uses the IndexRecoverer to rebuild each new Index's
segment file
Seems like a good tool for contrib?
3. IndexMerger - in reverse to IndexSplitter merges some indices into
single index; Uses a modified version of IndexWriter.addIndexes - it
does not optimize() in the beginning and in the end. This way the
resulting index is not a single huge cfs file, which is desirable in
some cases.
You should have a look at the current Lucene trunk: a new method
(called addIndexesNoOptimize) has been added that I think addresses
this same need.
4. IndexOptimizer - Optimizes existing index by merging the 'small'
segments and compacting the large segments (compacting means 'removing
the deleted docs within them'); Also converts to compound file format
any old-style "spilled" segments.
Ooh -- this sounds like a lighter weight version of the current
"optimize"? Compacting single segments would be particularly useful
for very large indices that receive many updates to each doc. It
seems like this could be a new method on IndexWriter?
Though I think this could break the index segments invariants (new
merge policy in IndexWriter in the trunk) when there are many deletes
against the large older segments (I think a fairly typical use case
actually).
All of the above mentioned tools are classes within the
org.apache.lucene.index package as they use some package-scope methods
and properties (+ they feel like belonging there).
Now the design change suggestion - it is about the 'deletable' related
code;
according to the source comments - the delayed deletion of files
through the 'deletable' is required on Window only as this OS prevents
files opened for reading to be deleted.
Working on the IndexOptimizer tool I found myself in a situation where I
needed to 'safe delete' a bunch of obsolete segments while having only
an (FS)Directory and a segment file name. And the 'safe delete' feature
is in IndexWriter. Then after reviewing the code I came to the
conclusion that the 'safe delete' feature logically belongs to the
(FS)Directory class, not to IndexWriter. I was able to move the
corresponding code from IndexWriter to (FS)Directory IMO this way is
better.
You should also look at the trunk for this one. The deletion logic
has moved into a separate class (IndexFileDeleter) which handles
figuring out which files 1) look to be Lucene index files, but 2) are
not in fact referenced by the current segments file, and then
safely deletes them (retries).
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]