Hello Dmitry, This would be a nice, juicy contribution. Your outline of the changes sounds good, but unfortunately I do not know how and if this will affect the performance of Lucene. Moreover, we don't even have any tools to measure the performance before and after the changes. :(
Regardless, I am looking forward to your changes, and this time I'll make sure we do a better job with them than we did with your Term Vector patches (more about that in a separate email). Otis --- Dmitry Serebrennikov <[EMAIL PROTECTED]> wrote: > Greetings, Luceeners! > > Looks like lot's of good stuff is happenning with the code as of > late. > It's great to see this momentum! > Here's some more action coming your way... > > --------------------------------------------------------- > We all love Lucene, but most would agree that it tends to use a very > large number of file handles. > This is especially true for applications that have one or more of the > > factors below: > a) use a merged index over a large number of indexes > b) experience index updates concurrent with searching > c) search through unoptimized indexes > d) use high merge factor settings to speed up indexing > e) have a large number of indexed fields > For a long time I've been contemplating an idea that can help > drastically reduce the number of file handles needed by Lucene. Now I > am > finally going to get a few days to make this happen (pending final > approval by the powers that be). So, I wanted to put out the general > plan of action and seek community comment on it early on. Over the > next > day or so, I intend to implement the changes outlined below (unless > of > course I get responses that steer me in a different direction). As I > get > more solid results (down to the diffs), I'll post them for further > review. But the sooner I get feedback, the more chance there is that > I > will actually be able to incorporate it. Hopefully, this will result > in > a set of patches that will solve the problems I am after, and be > useful > enough to the general Lucene population to be included into the tree. > > So, here goes. > > Lucene's indexes are built out of segments. Each segment consists of > a > number of files, which are written when the segment is created during > > indexing. Once the IndexWriter is closed, the segment files are not > modified, ever (except the file that lists deleted documents, if > any). > The proposed change is as follows: > - add code to IndexWriter.close() method to combine all of the > segment's files into a single file with a header that indicates start > > offset and a length for each of the new file's components, > corresponding > to individual files in the current segments. This will be done in > such a > way that the file will be able to contain any number of components - > this way we can support evolution of the segment structure in the > future. The deleted documents file will remain separate. > - add a new segment reader, or add code to the existing one, to > work > with these types of segments > - when this new segment reader opens its files, it can open one > file > object from the Directory for the combined file and then clone it for > > each of the files formerly in the segment. Each cloned file object > would > maintain its own position into the combined file and will have its > own > buffer as they did before. They will also need to know the starting > offset and a length of their fragment of the combined file. > > Some questions to solicit feedback: > *) I don't know all of the classes that will need changes yet, but I > think this can be accomplished with moderate effort in the index and > maybe the store packages. Does this seem reasonable? > *) I can't see any adverse effects of this change except possibly > one. > Since less OS file handles will be used, the way OS caching is > applied > to Lucene indexes will change. I know that Lucene relies on OS-level > file caching for a good part of its performance magic, but I lack the > > right experience to know what effect the proposed change will have on > > the performance. There should be the same number of disk accesses > overall, but obviously there will be concentrated in a single file > and > will be more spread out. The disk should not really thrash any more > than > before, since the same data will be read in the same order, just now > it > will be in a single file rather than in different files. However, if > OS > file caching is optimal only when a given file handle experiences > sequential reads, this can be a problem. Can anyone shed some light > on > what we can expect with this change? I am primarially interested in > Solaris and Windows (NT/2000) at this time, but I'd like to know of > possible impact on other OSes as well. > *) Given the above, is this a wothwhile idea? If not, can we modify > it > so as to limit the performance impact? > > Thanks for your consideration and feedback. > Dmitry. __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]