[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
Michael Busch closed LUCENE-624.
--------------------------------
Resolution: Won't Fix
Assignee: Michael Busch
I'm closing this issue, because:
- no votes or comments for almost half a year
- only indexing performance benefits slightly from this feature
- another config parameter in IndexWriter will probably confuse users more than
help them
> Segment size limit for compound files
> -------------------------------------
>
> Key: LUCENE-624
> URL: http://issues.apache.org/jira/browse/LUCENE-624
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael Busch
> Assigned To: Michael Busch
> Priority: Minor
> Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files
> are used to decrease the number of index files, because operating systems
> can't handle too many open file descriptors. On the other hand, a
> disadvantage of compound file format is the worse performance compared to
> multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about
> 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use
> compound format only for segments, that do not contain more documents than a
> specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few
> very big segments, but much more small segments. The best performance is
> actually just needed for the big segments, whereas a slighly worse
> performance for small segments shouldn't play a big role in the overall
> search performance.
> Consider the following example:
> Index Size: 1,500,000
> Merge factor: 10
> Max buffered docs: 100
> Number of indexed fields: 10
> Max. OS file descriptors: 1024
> in the worst case a not-optimized index could contain the following amount of
> segments:
> 1 x 1,000,000
> 9 x 100,000
> 9 x 10,000
> 9 x 1,000
> 9 x 100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files
> ==> only about 2 open indexes per machine could be handled by the operating
> system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be
> handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000
> would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open
> indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining
> the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files
> soon. Please let me know what you think about this improvement.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]