Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doron Cohen wrote: > Doug Cutting wrote: > > > Therefore, a "semi compound" segment file can be defined, that would be > > > made of 4 files (instead of 1): > > > - File 0: .fdx .tis .tvx > > > - File 1: .fdt .tii .tvd > > > - File 2: .frq .tvf > > > - File 3: .fnm .prx .fN > > > > I think this is a promising direction. Perhaps instead of adding a > > third index format, we can significantly improve the non-compound format > > without too much effort. For example, simply writing all the norms into > > a single file could have a large impact on total file handles and would > > be a rather simple change. We could start with that, then see if there > > are further incremental improvements to be had. > > We can start with that - at least it would set the number of segment files > to a fixed number - 11 - currently it depends on the number of fields with > norms. Okay, started with this step - see issue 756 http://issues.apache.org/jira/browse/LUCENE-756 > > One advantage of keeping the a plain non-compound format is educational / > debugging - it is often helpful to actually see the files being created on > disk. (Although, just concatenating all norms to a single file is simple > enough in that regard.) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Some work on NIO-based FSDirectory has already been done. Some performance info is included, too: http://issues.apache.org/jira/browse/LUCENE-519 http://issues.apache.org/jira/browse/LUCENE-414 Otis - Original Message From: Doug Cutting <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sunday, December 17, 2006 2:31:42 PM Subject: Re: potential indexing perormance improvement for compound index - cut IO - have more files though Doron Cohen wrote: > Also, if nio proves to be faster in this scenario, it might make sense to > keep current FSDirectory, and just add FSDirectoryNio implementation. If nio isn't considerably slower for single-threaded applications, I'd vote to simply switch FSDirectory to use nio, simplifying the public API by reducing choices. But if classic io is faster for single-threaded apps, and nio faster for multi-threaded, that would suggest adding a new, public, nio-based Directory implementation. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
I think the important issues are index size, stability and number of concurrent readers. We achieved the best performance by using a pool of file descriptors to a segment so we could avoid the synchronization block, but this only worked for large, relatively unchanging segments. On Dec 18, 2006, at 2:51 PM, Doug Cutting wrote: robert engels wrote: Using a shared FileChannel.pread actually performs a synchronization under Windows. Sigh. Still, it'd be no worse than current FSDirectory on Windows. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
robert engels wrote: Using a shared FileChannel.pread actually performs a synchronization under Windows. Sigh. Still, it'd be no worse than current FSDirectory on Windows. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
A word of caution here... Using a shared FileChannel.pread actually performs a synchronization under Windows. See JDK bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 I submitted this, and it was verified using the supplied test case. On Dec 17, 2006, at 1:31 PM, Doug Cutting wrote: Doron Cohen wrote: Also, if nio proves to be faster in this scenario, it might make sense to keep current FSDirectory, and just add FSDirectoryNio implementation. If nio isn't considerably slower for single-threaded applications, I'd vote to simply switch FSDirectory to use nio, simplifying the public API by reducing choices. But if classic io is faster for single-threaded apps, and nio faster for multi-threaded, that would suggest adding a new, public, nio-based Directory implementation. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doron Cohen wrote: Also, if nio proves to be faster in this scenario, it might make sense to keep current FSDirectory, and just add FSDirectoryNio implementation. If nio isn't considerably slower for single-threaded applications, I'd vote to simply switch FSDirectory to use nio, simplifying the public API by reducing choices. But if classic io is faster for single-threaded apps, and nio faster for multi-threaded, that would suggest adding a new, public, nio-based Directory implementation. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doug Cutting wrote: > > Therefore, a "semi compound" segment file can be defined, that would be > > made of 4 files (instead of 1): > > - File 0: .fdx .tis .tvx > > - File 1: .fdt .tii .tvd > > - File 2: .frq .tvf > > - File 3: .fnm .prx .fN > > I think this is a promising direction. Perhaps instead of adding a > third index format, we can significantly improve the non-compound format > without too much effort. For example, simply writing all the norms into > a single file could have a large impact on total file handles and would > be a rather simple change. We could start with that, then see if there > are further incremental improvements to be had. We can start with that - at least it would set the number of segment files to a fixed number - 11 - currently it depends on the number of fields with norms. One advantage of keeping the a plain non-compound format is educational / debugging - it is often helpful to actually see the files being created on disk. (Although, just concatenating all norms to a single file is simple enough in that regard.) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doug Cutting wrote: > Doug Cutting wrote: > > Yes. On 32-bit systems with indexes larger than 1GB or so, memory > > mapping is impractical, so synchronization is required around shared > > file handles (using Java's classic i/o APIs, w/o pread). The > > non-compound format, with more files, has fewer synchronization > > bottlenecks. One could of course achieve the same improvements in other > > ways, e.g., by pooling multiple IndexReaders per index, but in straight > > A-to-B comparisons, folks see better throughput with non-compound > > indexes for multi-threaded applications. > > On second thought, a good fix for this might be to simply convert > FSDirectory to use nio's pread support, eliminating file handle > synchronization even when mmap isn't used. Comparing the two for a small index (100,000 docs of the Reuters collection, index size 170MB) showed no evident search performance advantage for non-compound. For 300 parallel searches with traversing of docs compound was faster. But this is a small index, not in the 1GB range, and search was fast anyhow. I think it would make sense to first verify the advantage of nio over io in this multi-reading scenario with a synthetic scenario. Also, if nio proves to be faster in this scenario, it might make sense to keep current FSDirectory, and just add FSDirectoryNio implementation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doug Cutting wrote: Yes. On 32-bit systems with indexes larger than 1GB or so, memory mapping is impractical, so synchronization is required around shared file handles (using Java's classic i/o APIs, w/o pread). The non-compound format, with more files, has fewer synchronization bottlenecks. One could of course achieve the same improvements in other ways, e.g., by pooling multiple IndexReaders per index, but in straight A-to-B comparisons, folks see better throughput with non-compound indexes for multi-threaded applications. On second thought, a good fix for this might be to simply convert FSDirectory to use nio's pread support, eliminating file handle synchronization even when mmap isn't used. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Doug Cutting wrote: > I'm not yet convinced that the costs of this mid-point justify its > benefits. That was too negative. Let me try a more positive angle. Doron Cohen wrote: Therefore, a "semi compound" segment file can be defined, that would be made of 4 files (instead of 1): - File 0: .fdx .tis .tvx - File 1: .fdt .tii .tvd - File 2: .frq .tvf - File 3: .fnm .prx .fN I think this is a promising direction. Perhaps instead of adding a third index format, we can significantly improve the non-compound format without too much effort. For example, simply writing all the norms into a single file could have a large impact on total file handles and would be a rather simple change. We could start with that, then see if there are further incremental improvements to be had. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Marvin Humphrey wrote: Out of curiosity, does the non-compound format yield any search-time benefits? Yes. On 32-bit systems with indexes larger than 1GB or so, memory mapping is impractical, so synchronization is required around shared file handles (using Java's classic i/o APIs, w/o pread). The non-compound format, with more files, has fewer synchronization bottlenecks. One could of course achieve the same improvements in other ways, e.g., by pooling multiple IndexReaders per index, but in straight A-to-B comparisons, folks see better throughput with non-compound indexes for multi-threaded applications. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
On Dec 15, 2006, at 2:04 PM, Otis Gospodnetic wrote: I think Doron is right on the money here. I know one "customer" who'd be happy to trade its file descriptors for less IO - simpy.com. It's exactly what Doron describes - a busy system with a LOT of indices. File descriptors are kept under control by carefully closing IndexSearchers, plus I can always increase the max open-files limit. What I can't easily increase is the disk IO. Sure, I could go from CFS to the multi-file format, but it would be nice to have that third, middle ground choice. Out of curiosity, does the non-compound format yield any search-time benefits? I would think that would be the case only if the system- level file stream feeding the the buffered IndexInput objects were maintaining its own (unnecessary) buffers. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
Otis Gospodnetic wrote: I think Doron is right on the money here. I know one "customer" who'd be happy to trade its file descriptors for less IO - simpy.com. It's exactly what Doron describes - a busy system with a LOT of indices. File descriptors are kept under control by carefully closing IndexSearchers, plus I can always increase the max open-files limit. What I can't easily increase is the disk IO. Sure, I could go from CFS to the multi-file format, but it would be nice to have that third, middle ground choice. The problem is that adding that middle ground isn't free: it will complicate the code and make it harder to maintain and evolve. If you have good control over file handles, then non-compound format should work just fine, no? I'm not yet convinced that the costs of this mid-point justify its benefits. Perhaps the changes are simpler than I imagine. Perhaps it can be done very simply and elegantly with little impact on the code. If so, then my concerns will be reduced. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
I think Doron is right on the money here. I know one "customer" who'd be happy to trade its file descriptors for less IO - simpy.com. It's exactly what Doron describes - a busy system with a LOT of indices. File descriptors are kept under control by carefully closing IndexSearchers, plus I can always increase the max open-files limit. What I can't easily increase is the disk IO. Sure, I could go from CFS to the multi-file format, but it would be nice to have that third, middle ground choice. Otis - Original Message From: Doron Cohen <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Friday, December 15, 2006 2:55:41 PM Subject: Re: potential indexing perormance improvement for compound index - cut IO - have more files though "Mike Klaas" <[EMAIL PROTECTED]> wrote: > > My main comment is that the benefits of this change can be achieved by > using the non-compound index format. For people that care about the > difference in performance, it isn't difficult to configure your system > to mitigate the problems of the non-compound format, and they probably > have already done so. > > It would help the people who are file-descriptor conscious, but it > also increases lucene's fd footprint by a factor of four. That's right - people worried about indexing performance can easily apply setUseCompound(false). My guess though is that most people just keep the default setting. Large systems that maintain many indexes, would be worried about the number of file descriptors and would use compound format. But it is not clear to me what would be the preference in such systems - four times the file descriptors, or twice as much the IO? If such a third choice is supported - "semmi compound" - how many systems would {be able to / choose to} use it? Depending on the specific system maybe. I verified the IO factor, by counting bytes read in FSIndexInput.readInternal(byte[],int,int) and written in FSIndexOutput.flushBuffer(byte[],int): round vect stor cmpnd runCnt recsPerRun rec/s elapsedSecwrite read 0 true true true1 10 153.4 651.742 GB 1.9 GB - 1 true true false - - 1 - - 10 169.5 - - 589.82 - 1 GB 0.9 GB 2 false false true1 10 151.4 660.412 GB 1.9 GB - 3 false false false - - 1 - - 10 168.0 - - 595.37 - 1 GB 0.9 GB Indeed, there is a factor of two for both read bytes and written bytes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
"Mike Klaas" <[EMAIL PROTECTED]> wrote: > > My main comment is that the benefits of this change can be achieved by > using the non-compound index format. For people that care about the > difference in performance, it isn't difficult to configure your system > to mitigate the problems of the non-compound format, and they probably > have already done so. > > It would help the people who are file-descriptor conscious, but it > also increases lucene's fd footprint by a factor of four. That's right - people worried about indexing performance can easily apply setUseCompound(false). My guess though is that most people just keep the default setting. Large systems that maintain many indexes, would be worried about the number of file descriptors and would use compound format. But it is not clear to me what would be the preference in such systems - four times the file descriptors, or twice as much the IO? If such a third choice is supported - "semmi compound" - how many systems would {be able to / choose to} use it? Depending on the specific system maybe. I verified the IO factor, by counting bytes read in FSIndexInput.readInternal(byte[],int,int) and written in FSIndexOutput.flushBuffer(byte[],int): round vect stor cmpnd runCnt recsPerRun rec/s elapsedSecwrite read 0 true true true1 10 153.4 651.742 GB 1.9 GB - 1 true true false - - 1 - - 10 169.5 - - 589.82 - 1 GB 0.9 GB 2 false false true1 10 151.4 660.412 GB 1.9 GB - 3 false false false - - 1 - - 10 168.0 - - 595.37 - 1 GB 0.9 GB Indeed, there is a factor of two for both read bytes and written bytes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
On 12/14/06, Doron Cohen <[EMAIL PROTECTED]> wrote: But anyhow, this is not a negligible difference, and for real large indexes, and busy systems, when the just written non-compound segment is not in the system caches, it might have more effect. Possibly, search performance during indexing would be improved by less indexing IO. Also, delay for addDocument call that triggers a merge should become smaller. Thanks for your comments, also (but not only) on (1) an (3) above. My main comment is that the benefits of this change can be achieved by using the non-compound index format. For people that care about the difference in performance, it isn't difficult to configure your system to mitigate the problems of the non-compound format, and they probably have already done so. It would help the people who are file-descriptor conscious, but it also increases lucene's fd footprint by a factor of four. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
potential indexing perormance improvement for compound index - cut IO - have more files though
Hi, I would like to propose and get feedback on a potential indexing performance improvement for the case that compound file is used (this is the default). In compound segment mode, each merge operation is ended by writing a compound file. To be more precise, the merge result is first written to directory as a non-compound segment file, and then it is 'converted' into a compound segment file. This conversion involves reading the entire (non compound) segment, and writing it as a compound segment file. This means that compound mode indexing does twice as much indexing writing comparing to non-compound mode. (and there's also the reading of the non-compound segment). The reason for this two steps process in writing compound segment files is that per-segment files cannot be written sequentially, one by one - several files are created together, written interleaved. But I think that there is an intermediate state - between one-compound-segment-file and non-compound-many-files. To my understanding, at merge time, the following apply: - .fnm - field infos - independent of other files. - .fdx .fdt - store fields - interleaved with each other, independent of other files. - .tis .tii .frq .prx - dictionary and postings - interleaved with each other, independent of other files. - .tvx .tvd .tvf - term vectors - interleaved with each other, independent of other files. - .fN - norms - all these files written sequentially, independent of other files. Therefore, a "semi compound" segment file can be defined, that would be made of 4 files (instead of 1): - File 0: .fdx .tis .tvx - File 1: .fdt .tii .tvd - File 2: .frq .tvf - File 3: .fnm .prx .fN A merge should be able to write this segment representation at once, - no need to read and write again. Few questions: (1) is this correct at all, or have I overlooked something? (2) what performance gain would that buy? (3) is it reasonable to have 4 files per segment comparing to 1 file per segment? For (2), the indexing performance of non compound is an upper bound. I compared indexing speeds of compound and non compound, using the Reuters input set. Tried with stored+vectors, and without stored fields: round vect stor cmpnd runCnt recsPerRunrec/s elapsedSec 0 true true true121578150.2 143.69 - 1 true true false - - 1 - - 21578 - - 178.9 - - 120.58 2 false false true121578164.7 131.03 - 3 false false false - - 1 - - 21578 - - 184.3 - - 117.07 This is a 19% speed-up with stored+vectors, and 12% speed-up with no stored fields. As a side comment, it says something on IO vs. CPU in Lucene indexing, that cutting 1/2 (I think) the file output speeds-up by less than 20%. But anyhow, this is not a negligible difference, and for real large indexes, and busy systems, when the just written non-compound segment is not in the system caches, it might have more effect. Possibly, search performance during indexing would be improved by less indexing IO. Also, delay for addDocument call that triggers a merge should become smaller. Thanks for your comments, also (but not only) on (1) an (3) above. Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]