Hello Dmitry,

This would be a nice, juicy contribution.  Your outline of the changes
sounds good, but unfortunately I do not know how and if this will
affect the performance of Lucene.  Moreover, we don't even have any
tools to measure the performance before and after the changes. :(

Regardless, I am looking forward to your changes, and this time I'll
make sure we do a better job with them than we did with your Term
Vector patches (more about that in a separate email).


--- Dmitry Serebrennikov <[EMAIL PROTECTED]> wrote:
> Greetings, Luceeners!
> Looks like lot's of good stuff is happenning with the code as of
> late. 
> It's great to see this momentum!
> Here's some more action coming your way...
> ---------------------------------------------------------
> We all love Lucene, but most would agree that it tends to use a very 
> large number of file handles.
> This is especially true for applications that have one or more of the
> factors below:
>     a) use a merged index over a large number of indexes
>     b) experience index updates concurrent with searching
>     c) search through unoptimized indexes
>     d) use high merge factor settings to speed up indexing
>     e) have a large number of indexed fields
> For a long time I've been contemplating an idea that can help 
> drastically reduce the number of file handles needed by Lucene. Now I
> am 
> finally going to get a few days to make this happen (pending final 
> approval by the powers that be). So, I wanted to put out the general 
> plan of action and seek community comment on it early on. Over the
> next 
> day or so, I intend to implement the changes outlined below (unless
> of 
> course I get responses that steer me in a different direction). As I
> get 
> more solid results (down to the diffs), I'll post them for further 
> review. But the sooner I get feedback, the more chance there is that
> I 
> will actually be able to incorporate it. Hopefully, this will result
> in 
> a set of patches that will solve the problems I am after, and be
> useful 
> enough to the general Lucene population to be included into the tree.
> So, here goes.
> Lucene's indexes are built out of segments. Each segment consists of
> a 
> number of files, which are written when the segment is created during
> indexing. Once the IndexWriter is closed, the segment files are not 
> modified, ever (except the file that lists deleted documents, if
> any). 
> The proposed change is as follows:
>     - add code to IndexWriter.close() method to combine all of the 
> segment's files into a single file with a header that indicates start
> offset and a length for each of the new file's components,
> corresponding 
> to individual files in the current segments. This will be done in
> such a 
> way that the file will be able to contain any number of components - 
> this way we can support evolution of the segment structure in the 
> future. The deleted documents file will remain separate.
>     - add a new segment reader, or add code to the existing one, to
> work 
> with these types of segments
>     - when this new segment reader opens its files, it can open one
> file 
> object from the Directory for the combined file and then clone it for
> each of the files formerly in the segment. Each cloned file object
> would 
> maintain its own position into the combined file and will have its
> own 
> buffer as they did before. They will also need to know the starting 
> offset and a length of their fragment of the combined file.
> Some questions to solicit feedback:
> *) I don't know all of the classes that will need changes yet, but I 
> think this can be accomplished with moderate effort in the index and 
> maybe the store packages. Does this seem reasonable?
> *) I can't see any adverse effects of this change except possibly
> one. 
> Since less OS file handles will be used, the way OS caching is
> applied 
> to Lucene indexes will change. I know that Lucene relies on OS-level 
> file caching for a good part of its performance magic, but I lack the
> right experience to know what effect the proposed change will have on
> the performance. There should be the same number of disk accesses 
> overall, but obviously there will be concentrated in a single file
> and 
> will be more spread out. The disk should not really thrash any more
> than 
> before, since the same data will be read in the same order, just now
> it 
> will be in a single file rather than in different files. However, if
> OS 
> file caching is optimal only when a given file handle experiences 
> sequential reads, this can be a problem. Can anyone shed some light
> on 
> what we can expect with this change? I am primarially interested in 
> Solaris and Windows (NT/2000) at this time, but I'd like to know of 
> possible impact on other OSes as well.
> *) Given the above, is this a wothwhile idea? If not, can we modify
> it 
> so as to limit the performance impact?
> Thanks for your consideration and feedback.
> Dmitry.

Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to