[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1044: --------------------------------------- Attachment: LUCENE-1044.take6.patch New rev of this patch. All tests pass. I think it's ready to commit, but I'll wait a few days for comments. This patch has a small change to the segments_N file: it adds a checksum to the end. I added ChecksumIndexInput/Output that wrap an existing IndexInput/Output for this. This is used to verify the file is "intact" before trusting its contents when opening the index. We need this to guard against the machine crashing after we've written segments_N and before we've succeeded in syncing it. Unfortunately, in testing performance, I still see a sizable (~30-50%) performance hit to indexing throughput, on windows computers (XP Pro laptop & Win 2003 Server R64 computer). It seems that calling sync was causing IO in other threads (ie flushing a new segment) to drasically slow down. Note that this is only when autoCommit=true; if it's false then performance is only slightly worse (because only on closing the writer do we sync) So I tried sleeping, after writing and before syncing. I sleep based on number of bytes written, for up to 10 seconds, and amazingly, this greadly reduces the performance loss on the windows computers, and doesn't hurt performance on Linux/OS X computers. I think this must be because calling sync immediately forces the OS to write dirty buffers to disk "in a rush" (severely impacting IO writes from other threads), whereas if you wait first, you let the OS schedule those writes on its own, at good times (maybe when IO system is "relatively" idle). It's disappointing to have to "game" the OS to gain back this performance. I wish Java had a "waitUntilSync'd" to do the same things as fsync, but without "rushing" the OS. On Linux 2.6.22 on a RAID5 array I still see a net performance cost of ~12%, sleeping or no sleeping. On Mac OS X it's ~3% loss. Other fixes: * DirectoryIndexReader's doCommit now also syncs * Improved logic on when we must sync-before-CFS: it's not necessary if the just-merged segments are not referenced by the last commit point (ie if they were all flushed during this writer session) * Created SegmentInfos.commit() method, which writes and then syncs the next segments_N file * Simplified sync() logic now that merge threads are stopped before writer is closed * Changed CMS.newMergeThread to name its threads * More test cases * Various other small fixes Here are test details. I index first 200K Wikipedia docs with this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.term.vector.offsets = true doc.term.vector.positions = true doc.maker.forever = false directory=FSDirectory { "BuildIndex" CreateIndex { "AddDocs" AddDoc > : 200000 CloseIndex } RepSumByPref BuildIndex Win2003 R64, JVM 1.6.0_03 trunk: 523 sec patch: 547 sec (5% slower) Win XP Pro, laptop hard drive, JVM 1.4.2_15-b02 trunk: 1237 sec patch: 1278 sec (3% slower) Linux ReiserFS on 6 drive RAID 5 array, JVM 1.5.0_08 trunk: 483 sec patch: 539 sec (12% slower) Mac OS X 10.4 4-drive RAID 0 array, JVM 1.5.0_13 trunk: 268 sec patch: 276 sec (3% slower) > Behavior on hard power shutdown > ------------------------------- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 > Reporter: venkat rangan > Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch, > LUCENE-1044.take5.patch, LUCENE-1044.take6.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]