[jira] Created: (LUCENE-665) temporary file access denied on Windows
temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2. FSDirectory_Retry_Logic.patch 3. Test_Output.txt- output of the test with the patch, on my XP. Only the createNewFile() case had to be bypassed in this test, but for another program I also saw the renameFile() being bypassed. - Doron -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Doron Cohen updated LUCENE-565: --- Attachment: TestBufferedDeletesPerf.java perf-test-res.JPG perfres.log I ran a performance test for interleaved adds and removes - and compared between IndexModifier and NewIndexModifier. Few setups were tested, with a few combinations of consecutive adds before a delete takes place, maxBufferredDocs, and number of total test iterations, where each iteration does the conseutive adds and then does the deletes. Each setup ran in this order - orig indexModifier, new one, orig, new one, and the best time out of the two runs was used. Results indicate that NewIndexModifier is far faster for most setups. Attached is the performance test, the performance results, and the log of the run. The performance test is written as a Junit test, and it fails in case the original IndexModfier is faster than the new one by more than 1 second (smaller than 1 sec difference is considered noise). Test was run on XP (SP1) with IBM JDK 1.5. Test was first failing with access denied errors due to what seems to be an XP issue. So in order to run this test on XP (and probably other Windows platforms) the patch from http://issues.apache.org/jira/browse/LUCENE-665 should be applied first. It is interesting to notice that in addition to preformance gain, NewIndexModifier seems less sensitive to access denied XP problems, because it closes/reopens readers and writers less frequently, and indeed, at least in my runs, these errors had to be bypassed (by the retry patch) only for the current index-modifier. - Doron Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, perf-test-res.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12430919 ] Doron Cohen commented on LUCENE-665: just to confirm, is it the COMMIT lock that's throwing these unhandled exceptions (not the WRITE lock)? If so, lockless commits would fix this. In my tests so far, these errors appeared only for commit locks. However I consider this a coincidence - there is nothing as far as I can understand special with commit locks comparing to write locks - in particular they both use createNewFile. So, I agree that lockless commits would prevent this, which is good, but we cannot count on that it would not happen for write locks as well. Also, the more I think about it the more I like lock-less commits, still, they would take a while to get into Lucene, while this simple fix can help easily now. Last, with lock-less commits, still, there would be calls to createNewFile for write lock, and there would be calls to renameFile() and other IO file operations, intensively. By having a safety code like the retry logic that is invoked only in rare cases of these unexpected, some nasty errors would be reduced, more users would be happy. Can you provide more details on the exceptions you're seeing? Especially on the cannot rename file exception? Here is one from my run log, that occurs at the call to optimize, after at the end of all the add-remove iterations - [junit] java.io.IOException: Cannot rename C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deleteable.new to C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable [junit] at org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:328) [junit] at org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:280) [junit] at org.apache.lucene.index.IndexWriter.writeDeleteableFiles(IndexWriter.java:967) [junit] at org.apache.lucene.index.IndexWriter.deleteSegments(IndexWriter.java:911) [junit] at org.apache.lucene.index.IndexWriter.commitChanges(IndexWriter.java:872) [junit] at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:823) [junit] at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:798) [junit] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:614) [junit] at org.apache.lucene.index.IndexModifier.optimize(IndexModifier.java:304) [junit] at org.apache.lucene.index.TestBufferedDeletesPerf.doOptimize(TestBufferedDeletesPerf.java:266) [junit] at org.apache.lucene.index.TestBufferedDeletesPerf.measureInterleavedAddRemove(TestBufferedDeletesPerf.java:218) [junit] at org.apache.lucene.index.TestBufferedDeletesPerf.doTestBufferedDeletesPerf(TestBufferedDeletesPerf.java:144) [junit] at org.apache.lucene.index.TestBufferedDeletesPerf.testBufferedDeletesPerfCase7(TestBufferedDeletesPerf.java:134) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:585) [junit] at junit.framework.TestCase.runTest(TestCase.java:154) [junit] at junit.framework.TestCase.runBare(TestCase.java:127) [junit] at junit.framework.TestResult$1.protect(TestResult.java:106) [junit] at junit.framework.TestResult.runProtected(TestResult.java:124) [junit] at junit.framework.TestResult.run(TestResult.java:109) [junit] at junit.framework.TestCase.run(TestCase.java:118) [junit] at junit.framework.TestSuite.runTest(TestSuite.java:208) [junit] at junit.framework.TestSuite.run(TestSuite.java:203) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:297) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:672) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:567) [junit] Caused by: java.io.FileNotFoundException: C:\Documents and Settings\tpowner\Local Settings\Temp\test.perf\index_24\deletable (Access is denied) [junit] at java.io.FileOutputStream.open(Native Method) [junit] at java.io.FileOutputStream.init(FileOutputStream.java:179) [junit] at java.io.FileOutputStream.init(FileOutputStream.java:131) [junit] at org.apache.lucene.store.FSDirectory.doRenameFile(FSDirectory.java:312) [junit] ... 27 more This exception btw is from the performance test for interleaved-adds-and-removes - issue 565 - so IndexWriter line numbers here relate to applying recent patch from issue 565 (though the same errors are obtained with
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12431100 ] Doron Cohen commented on LUCENE-665: obtain() is supposed to return success or failure immediately. I'd be tempted to override obtain(timout) for FS locks and keep the retry logic there. Right, this is the right place for the retry. This way changes are limited to FSDirectory, and obtain() remains unchanged. I am tesing this now and would subit an updated patch, where: - UNEXPECTED_ERROR_RETRY_DELAY is set to 100ms. - timeout in obtain(timeout) is always repected (even if the presence of those unexpected io errors). - IOExceptions bubble up as discussed. temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2. FSDirectory_Retry_Logic.patch 3. Test_Output.txt- output of the test with the patch, on my XP. Only the createNewFile() case had to be
[jira] Commented: (LUCENE-635) [PATCH] Decouple locking implementation from Directory implementation
[ http://issues.apache.org/jira/browse/LUCENE-635?page=comments#action_12431341 ] Doron Cohen commented on LUCENE-635: While updating my patch for 665 according the changes here, I noticed something - I may be wrong here - but it seems to me that until this change, all the actual FS access operations where performed by FSDirectory, using the Directory API. The new SimpleFSLock and SimpleFSLockFactory also access the FS directly, not through FSDirectory API. That Directory abstraction in Lucene allows to develop Lucene-in-RAM, Lucene-in-DB, etc. It is a nice feature. Guess we can say: well, now the abstraction is made of two interfaces - Lock and Directory, just make sure you use 'matching' implementations of them. This seems weaker than before. Or, can limit all file access to go through FSDirectory - - one possibility is to add to LockFactory a Directory object (as a class member); SimpleFSLockFactory can require thas Directory object to be FSDirectory (cast, and fail otherwise); also, FSDirectory should be extened with createSingleFile(), mkdirs() and isDirectory(). [PATCH] Decouple locking implementation from Directory implementation - Key: LUCENE-635 URL: http://issues.apache.org/jira/browse/LUCENE-635 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.0 Reporter: Michael McCandless Assigned To: Yonik Seeley Priority: Minor Fix For: 2.0.1 Attachments: LUCENE-635-Aug27.patch, LUCENE-635-Aug3.patch, patch-Jul26.tar This is a spinoff of http://issues.apache.org/jira/browse/LUCENE-305. I've opened this new issue to capture that it's wider scope than LUCENE-305. This is a patch originally created by Jeff Patterson (see above link) and then modified as described here: http://issues.apache.org/jira/browse/LUCENE-305#action_12418493 with some small additional changes: * For each FSDirectory.getDirectory(), I made a corresponding version that also accepts a LockFactory instance. So, you can construct an FSDirectory with your own LockFactory. * Cascaded defaulting for FSDirectory's LockFactory implementation: if you pass in a LockFactory instance, it's used; else if setDisableLocks was called, we use NoLockFactory; else, if the system property org.apache.lucene.store.FSDirectoryLockFactoryClass is defined, we use that; finally, we'll use the original locking implementation (SimpleFSLockFactory). The gist is that all locking code has been moved out of *Directory and into subclasses of a new abstract LockFactory class. You can now set the LockFactory of a Directory to change how it does locking. For example, you can create an FSDirectory but set its locking to SingleInstanceLockFactory (if you know all writing/reading will take place a single JVM). The changes pass all unit tests (on Ubuntu Linux Sun Java 1.5 and Windows XP Sun Java 1.4), and I added another TestCase to test the LockFactory code. Note that LockFactory defaults are not changed: FSDirectory defaults to SimpleFSLockFactory and RAMDirectory defaults to SingleInstanceLockFactory. Next step (separate issue) is to create a LockFactory that uses the OS native locks (through java.nio). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12431354 ] Doron Cohen commented on LUCENE-565: Is it that results that were returned are suddenly (say after updates) not returned anymore (indicating something bad happened to existing index)? Or is it that the search does not reflect recent changes? I don't remember how often Solr closes and re-opens the writer/modifier... with this patch a delete does not immediately cause a flush to disk - so flushes are controlled by closing the NewIndexModifier (and re-opening, since there no flush() method) and by the limits for max-bufferred-docs and max-bufferred-deletes. If this seems relevant to your case, what limits are in effect? Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, perf-test-res.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12431419 ] Doron Cohen commented on LUCENE-565: Just to make sure on the scenario - are you - (1) using NewIndexModifier at all, or (2) just letting Solr use this IndexWriter (with the code changes introduced to ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release) IndexModifier. As is, Solr would not use NewIndexModifier or IndexModifier at all. For case (2) above the bufferred deletes logic is not in effect at all. I wonder if it possibe to re-create this with a simple Lucene stand-alone (test) program rather than with Solr - it would be easier to analyze. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, perf-test-res.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes,
[jira] Updated: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ] Doron Cohen updated LUCENE-665: --- Attachment: FSDirs_Retry_Logic_3.patch I am attaching an updated patch - FSDirs_Retry_Logic_3.patch. In this update: - merge with code changes by issue 635 (decouple locking from directory) - modified by recommendations in above comments: - do not rely on specific exception message text. - overide lock.obtain(timeout) and handle unexpected exceptions there. - do not modify logic of obtain() (no changes to this method). - UNEXPECTED_ERROR_RETRY_DELAY set to 100ms. - debug prints commented out. ant test tests all pass. My stress IO test passes as well. temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2. FSDirectory_Retry_Logic.patch 3. Test_Output.txt- output of the test with the patch, on my XP. Only the
[jira] Commented: (LUCENE-635) [PATCH] Decouple locking implementation from Directory implementation
[ http://issues.apache.org/jira/browse/LUCENE-635?page=comments#action_12431666 ] Doron Cohen commented on LUCENE-635: We could (as you're suggesting) indeed extend FSDirectory so that it provided the low level methods required by a locking implementation, and then alter SimpleFSLockFactory/NativeFSLockFactory (or make a new LockFactory) so that all underlying IO is through the FSDirectory instead. Yes, this is exactly (and only) what I am suggesting to consider - to include a Directory member within the LockFactory so that it is clear that any LockFactory implementation operates in the realm of a directory (implementation) and is using it for any actual store accesses. [PATCH] Decouple locking implementation from Directory implementation - Key: LUCENE-635 URL: http://issues.apache.org/jira/browse/LUCENE-635 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.0 Reporter: Michael McCandless Assigned To: Yonik Seeley Priority: Minor Fix For: 2.0.1 Attachments: LUCENE-635-Aug27.patch, LUCENE-635-Aug3.patch, patch-Jul26.tar This is a spinoff of http://issues.apache.org/jira/browse/LUCENE-305. I've opened this new issue to capture that it's wider scope than LUCENE-305. This is a patch originally created by Jeff Patterson (see above link) and then modified as described here: http://issues.apache.org/jira/browse/LUCENE-305#action_12418493 with some small additional changes: * For each FSDirectory.getDirectory(), I made a corresponding version that also accepts a LockFactory instance. So, you can construct an FSDirectory with your own LockFactory. * Cascaded defaulting for FSDirectory's LockFactory implementation: if you pass in a LockFactory instance, it's used; else if setDisableLocks was called, we use NoLockFactory; else, if the system property org.apache.lucene.store.FSDirectoryLockFactoryClass is defined, we use that; finally, we'll use the original locking implementation (SimpleFSLockFactory). The gist is that all locking code has been moved out of *Directory and into subclasses of a new abstract LockFactory class. You can now set the LockFactory of a Directory to change how it does locking. For example, you can create an FSDirectory but set its locking to SingleInstanceLockFactory (if you know all writing/reading will take place a single JVM). The changes pass all unit tests (on Ubuntu Linux Sun Java 1.5 and Windows XP Sun Java 1.4), and I added another TestCase to test the LockFactory code. Note that LockFactory defaults are not changed: FSDirectory defaults to SimpleFSLockFactory and RAMDirectory defaults to SingleInstanceLockFactory. Next step (separate issue) is to create a LockFactory that uses the OS native locks (through java.nio). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12431801 ] Doron Cohen commented on LUCENE-665: I think I know which software is causing/exposing this behavior in my environment. This is the SVN client I am using - TortoiseSVN. I tried the following sequence: 1) Run with TortoiseSVN installed - the test generates these access denied: errors (and bypasses them). 2) Uninstalled TortoiseSVN (+reboot), run test - pass with no access denied errorrs. 3) Installed TortoiseSVN again (+reboot), run test - same access denied errors again. I am using most recent stable TotoiseSVN version - 1.3.5 build 6804 - 32 bit, for svn-1.3.2, downloaded from http://tortoisesvn.tigris.org/. There is an interesting discussion thread of these type of errors on Windows platforms in svn forums - http://svn.haxx.se/dev/archive-2003-10/0136.shtml. At that case it was svn that suffers from these errors. It says ...Windows allows applications to tag-along to see when a file has been written - they will wait for it to close and then do whatever they do, usually opening a file descriptor or handle. This would prevent that file from being renamed for a brief period... TortoiseSVN is a shell extension integrated into Windows explorer. As such, it probably demonstrates the tag-along behavior described above. (BTW, it is a great svn client to my opinion) Here is another excerpt from that discussion thread - sleep(1) would work, I suppose. ;~) Most of the time, but not all the time. The only way I've made it work well on all the machines I've tried it on is to put it into a sleep(1) and retry loop of at *least* 20 or so attempts. Anything less and it still fails on some machines. That implies it is very dependent on machine speed or something, which means sleep times/retry times are just guessing games at best. If I could just get it recreated outside of Subversion and prove it's a Microsoft problem...although it probably still wouldn't get fixed for months at least. We don't know that this is a bug in TortoiseSVN. We cannot tell that there are no other such tag-along applications in users machines. One cannot seriously expect this Win32 behavior to be fixed. I guess the question is - is it worth for Lucene to attempt to at least reduce chances of failures in this case (I say yes:-) temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Doron Cohen updated LUCENE-565: --- Attachment: perf-test-res2.JPG Updated performance test results - perf-test-res2.JPG - in avarage, the new code is *9* times faster! What have changed? - in previous test I forgot to set max-buffered-deletes. After fixing so, I removed the test cases with max-buffer of 5,000 and up, because they consumed too much memory, and added more practical (I think) cases of 2000 and 3000. Here is a textual summary of the data in the attached image: max buf add/del 10 10 100 1000 2000 3000 iterations 1 10100 100 200 300 adds/iteration 10 10 10 10 10 10 dels/iteration 55 555 5 orig time (sec) 0.13 0.869.57 8.8822.74 44.01 new time (sec) 0.20 0.95 1.74 1.302.16 3.08 Improvement (sec)-0.07-0.09 7.83 7.5820.58 40.94 Improvement (%) -55% -11% 82% 85% 90% 93% Note: for the first two cases new code is slower by 11% and 55%, but this is a very short test case, - the absolute difference here is less than 100ms, comparing to the other cases, where the difference is measured in seconds and 10's of seconds. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12432216 ] Doron Cohen commented on LUCENE-565: I agree - I also suspected it might change the merge behavior (and also had reflections from the repeated trials to have that simple Indexwriter buffered-docs patch correct...:-). Guess I just wanted to get a feeling if there is interest to include this patch before I delve into it too much - and the perf test was meant to see for my self if it really helps. I was a bit surprised that it speeds 9 times in an interleaving add/delete scenario. Guess this by itself now justifies delving into this patch, analyzing merge behavior as you suggest - will do - I think idealy should try this patch not to modify the merge behavior. About the test - l was trying to test what I thought is a realistic use scenario (max-buf, etc.) - I have a fixed version of the perf test that is easier to modify for different scenarios - can upload it here if there is interest. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier
[jira] Updated: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ] Doron Cohen updated LUCENE-665: --- Attachment: FSWinDirectory.patch Attached patch - FSWinDirectory - implements retry logic of FS operations in a separate non default directory class as discussed above. By default this new class is not used. Applications can start using it by replacing the IMPL class in FSDirectory to be the new class FSWinDirectory. There are two ways to do this - by setting a system property (this is the original mechanism), or by calling FSDirectory static (new) method - setFSDirImplClass(name). There are 3 new classes in this patch: - FSWinDirectory (extends FSDirectory) - SimpleFSWinLockFactory (extends SimpleFSLockFactory) - TestWinLockFactory (extends TestLockFactory). Few simple modifications were required in FSDirectory, SimpleFSLockFactory and TestLockfactory in order to allow inheritance Tests: - ant test passes with new code. - For test, I modified my copy of build-common.xml to set a system property so that the new WinFS class was always in effect and ran the tests - all passed. - my stress test TestinterleavedAddAndRemoves fails in my env by default and passes when FSWinDirectory is in effect. temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436980 ] Doron Cohen commented on LUCENE-675: Few things that would be nice to have in this performance package/framework - () indexing only overall time. () indexing only time changes as the index grows (might be the case that indexing performance starts to misbehave from a certain size or so). () search single user while indexing () search only single user () search only concurrent users () short queries () long queries () wild card queries () range queries () queries with rare words () queries with common words () tokenization/analysis only (above indexing measurements include tokenization, but it would be important to be able to prove to oneself that tokenization/analysis time is not hurt by a recent change). () parametric control over: () () location of test input data. () () location of output index. () () location of output log/results. () () total collection size (total number of bytes/characters read from collection) () () document (average) size (bytes/chars) - test can break input data and recompose it into documents of desired size. () () implicit iteration size - merge-factor, max-buffered-docs () () explicit iteration size - how often the perf test calls () () long queries text () () short queries text () () which parts of the test framework capabilities to run () () number of users / threads. () () queries pace - how many queries are fired in, say, a minute. Additional points: () Would help if all test run parameters are maintained in a properties (or xml config) file, so one can easily modify the test input/output without having to recompile the code. () Output to allow easy creation of graphs or so - perhaps best would be to have an result object, so others can easily extend with additional output formats. () index size as part of output. () number of index files as part of output (?) () indexing input module that can loop over the input collection. This allows to test indexing of a collection larger than the actual input collection being used. Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: LuceneBenchmark.java We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ] Doron Cohen updated LUCENE-665: --- Attachment: FSWinDirectory_26_Sep_06.patch Updated the patch according to review comments by Hoss, plus: - protect currMillis usage from system clock modifications. - all Win specific code in a single Java file with two inner classes, for cleaner javadocs (now waitForRetry() is provate). Tested as previous patch: - ant test passes with new code. - For test, modified build-common.xml to set a system property so that the new WinFS class was always in effect and ran the tests - all passed. - my stress test TestinterleavedAddAndRemoves fails in my env by default and passes when FSWinDirectory is in effect. temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, FSWinDirectory_26_Sep_06.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2.
[jira] Updated: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=all ] Doron Cohen updated LUCENE-664: --- Attachment: boosts_plus_scoring_formula.patch (1) added a section in Scoring.xml for Search Results Boosts, on ways to boost in Lucene, at search time and at indexing time. (2) updated the presentation of the scoring formula in Similarity.java, to: - closely reflect the scoring code/process. - distinguish between indexing time factors and search time factors, and - point to differences between a scoring notion (e.g.tf, idf) to the way it is computed. As result the scoring formula is presented differently in Similarity.java and in Scoring.html. I can update this if there are no objections to the updated formula presentation. [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12438854 ] Doron Cohen commented on LUCENE-664: Hi Grant, For part 1, I am ok with having it after the scoring formula. For part 2, my motivation was to make it more clear in: - - what's inside the sum and what's outside (as you said). - - what's decided at indexing time and what's still controllable at search time. - - how boosts and encoding/decoding play in. - - what's fixed and what can be modified, by subclassing, say, DefaultSimilarity. So {indexBoost, searchBoost, normalizer} were the tools to clear this up, and also to make the formula shorter and easier to read in a glance. Naturally, after delving so deep into it is now clear to me, but you are right, it would be good to hear from others how they like this part. Thanks, Doron [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12439370 ] Doron Cohen commented on LUCENE-664: Two quick questions: I think 'norm' is a good term for the product of lengthNorm(d) and field boost. That's what it is called consistently in the code and API. This quantity is represented in two places, but seems like a logical candidate for the sort of factoring done here. Norm would also Include the doc boost, right? So this means replacing *indexBoost* with *norm* ? This could be placed to the left of the sigma, since it does not depend on t. I think that norm depends on the field name and there may be terms of more than one field in the query.? [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12440592 ] Doron Cohen commented on LUCENE-664: Going to work on this now, according to comments by Doug and Grant. Will give a try to the include idea - client side iframe as Chris suggested, see how it works. Iframe don't rely on Javascript (might be turned off for some users). There are downsides to iframe too - possible scrollbars etc. - so need to see how it looks, and need to check if it is possible to somehow also include it in Scoring.html, otherwise guess we just link to it from there. http://www.boutell.com/newfaq/creating/include.html [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12440648 ] Doron Cohen commented on LUCENE-664: I played with including the formula from a separate file, Client Side Include. === Summary === I think the include is not going to work well enough and hence not worth to invest in it. So, bottom line, I give up for now on include, and so I will make the changes in Similarity.java. === Details === I know of 3 ways to do this: Javascript, Iframe, Object-Embed. Iframe can work for both the javadocs and the xdocs - I think that Embed also would work though I did not try it. Both Iframe and Embed have a problem of appearance: you have to decide on the size of the frame showm, in pixels. If you set too large an area, blank space will remain. If you set a too small area, scroll bars would show up. If the user changes the text side, the required area size is changing, but not the allocated area, so scrollbars or blank space are showing up and disapearing. Very ugly. Iframe also has an issue with inner links navigation: once you navigate to an anchor text in the iframe part (this works), the back action from some reason does not work (both Firefox and IE). The javascript approach should not have these issues, because the imported text becomes part of the embedding page (the imported text is dynamically generated). I saw that Javadocs themselves use javascript (at least in 1.5) so I feel better with using this. However to use Javascript you have to put some javascript code in the HTML Header, as well as an onload event in the BODY tag. I didn't find a way to do this with Javadocs. (Another tricky issue with Javascript is that outgoing links from the imported text have the base address of the embedding page. So references going out from the embedded text should be different in Similarity.html and Scoring.html (which are in separate directories). But I think this can be resolved with passing a 'base' param to the include() function.) Bottom line, I give up for now on include, and so I will make the changes in Similarity.java. [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=all ] Doron Cohen updated LUCENE-664: --- Attachment: scoring_formula_2.patch I am attaching scoring_formula_2.patch - modifed scoring formula as suggested. Additional changes here: - order of the explanation parts: detailed norm part moved to end; tf and idf moved to start, so most of the stuff is visble at first glance. - links in the formula go to the appropriate explanation bullet. - formula itself is framed (border=1) for easier orientation within all the other text. [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch, scoring_formula_2.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12441194 ] Doron Cohen commented on LUCENE-664: One comment for Scoring.html: Tthe last sentence in the Score Boosting paragraph says: At scoring (search) time, this norm is brought into the score of document as indexBoost, as shown by the formula in Similarity. To fix this, we should replace indexBoost by norm(t,d) [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch, scoring_formula_2.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc
[ http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12441280 ] Doron Cohen commented on LUCENE-664: I just noticed that the link to TermScorer in Understanding the Scoring Formula is broken b/c TermScorer has package visibility. Can be fixed by saying instead ..., especially the scorer for TermQuery and linking to TermQuery. [PATCH] small fixes to the new scoring.html doc --- Key: LUCENE-664 URL: http://issues.apache.org/jira/browse/LUCENE-664 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.0.1 Reporter: Michael McCandless Attachments: boosts_plus_scoring_formula.patch, lucene.uxf, scoring-small-fixes.patch, scoring-small-fixes2.patch, scoring-small-fixes3.patch, scoring_formula_2.patch This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant Steve others! I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the Understanding the Scoring Formula section so far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-678) [PATCH] LockFactory implementation based on OS native locks (java.nio.*)
[ http://issues.apache.org/jira/browse/LUCENE-678?page=comments#action_12443304 ] Doron Cohen commented on LUCENE-678: The patch added a call to writer.close() in TestLockFactory - testFSDirectoryTwoCreates(). This is just before the 2nd attempt to create an index writer with override. This line should probably be removed, as it cancels the second part of that test case, right? [PATCH] LockFactory implementation based on OS native locks (java.nio.*) Key: LUCENE-678 URL: http://issues.apache.org/jira/browse/LUCENE-678 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Yonik Seeley Priority: Minor Fix For: 2.0.1 Attachments: LUCENE-678-patch.txt The current default locking for FSDirectory is SimpleFSLockFactory. It uses java.io.File.createNewFile for its locking, which has this spooky warning in Sun's javadocs: Note: this method should not be used for file-locking, as the resulting protocol cannot be made to work reliably. The FileLock facility should be used instead. So, this patch provides a LockFactory implementation based on FileLock (using java.nio.*). All unit tests pass with this patch, on OS X (10.4.8), Linux (Ubuntu 6.06), and Windows XP SP2. Another benefit of native locks is the OS automatically frees them if the JVM exits before Lucene can free its locks. Many people seem to hit this (old lock files still on disk) now. I've created this new class: org.apache.lucene.store.NativeFSLockFactory and added a couple test cases to the existing TestLockFactory. I've left SimpleFSLockFactory as the default locking for FSDirectory for now. I think we should get some usage / experience with NativeFSLockFactory and then later on make it the default locking implementation? I also tested changing FSDirectory's default locking to NativeFSLockFactory and all unit tests still pass (on the above platforms). One important note about locking over NFS: some NFS servers and/or clients do not support it, or, it's a configuration option or mode that must be explicitly enabled. When it's misconfigured it's able to take a long time (35 seconds in my case) before throwing an exception. To handle this, I acquire release a random test lock on creating the NativeFSLockFactory to verify locking is configured properly. A few other small changes in the patch: - Added a failure reason to Lock.java so that in obtain(lockWaitTimeout), if there is a persistent IOException in trying to obtain the lock, this can be messaged included in the Lock obtain timed out that's raised. - Corrected javadoc in SimpleFSLockFactory: it previously said the wrong system property for overriding lock class via system properties - Fixed unhandled IOException when opening an IndexWriter for create, if the locks dir does not exist (just added lockDir.exists() check in clearAllLocks method of SimpleFSLockFactory NativeFSLockFactory. - Fixed a few small unrelated issues with TestLockFactory, and also fixed tests to accept NativeFSLockFactory as the default locking implementation for FSDirectory. - Fixed a typo in javadoc in FieldsReader.java - Added some more javadoc for the LockFactory.setLockPrefix -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-678) [PATCH] LockFactory implementation based on OS native locks (java.nio.*)
[ http://issues.apache.org/jira/browse/LUCENE-678?page=comments#action_1244 ] Doron Cohen commented on LUCENE-678: Michael, I must be misunderstanding something then... That test case is verifying that the 2nd index writer indeed removes any leftover lockfiles created by the first one. Can there be any leftovers once the first writer was closed? It did not intend to test the case (but previously was).. Could you explain why the change? Thanks, Doron [PATCH] LockFactory implementation based on OS native locks (java.nio.*) Key: LUCENE-678 URL: http://issues.apache.org/jira/browse/LUCENE-678 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Yonik Seeley Priority: Minor Fix For: 2.0.1 Attachments: LUCENE-678-patch.txt The current default locking for FSDirectory is SimpleFSLockFactory. It uses java.io.File.createNewFile for its locking, which has this spooky warning in Sun's javadocs: Note: this method should not be used for file-locking, as the resulting protocol cannot be made to work reliably. The FileLock facility should be used instead. So, this patch provides a LockFactory implementation based on FileLock (using java.nio.*). All unit tests pass with this patch, on OS X (10.4.8), Linux (Ubuntu 6.06), and Windows XP SP2. Another benefit of native locks is the OS automatically frees them if the JVM exits before Lucene can free its locks. Many people seem to hit this (old lock files still on disk) now. I've created this new class: org.apache.lucene.store.NativeFSLockFactory and added a couple test cases to the existing TestLockFactory. I've left SimpleFSLockFactory as the default locking for FSDirectory for now. I think we should get some usage / experience with NativeFSLockFactory and then later on make it the default locking implementation? I also tested changing FSDirectory's default locking to NativeFSLockFactory and all unit tests still pass (on the above platforms). One important note about locking over NFS: some NFS servers and/or clients do not support it, or, it's a configuration option or mode that must be explicitly enabled. When it's misconfigured it's able to take a long time (35 seconds in my case) before throwing an exception. To handle this, I acquire release a random test lock on creating the NativeFSLockFactory to verify locking is configured properly. A few other small changes in the patch: - Added a failure reason to Lock.java so that in obtain(lockWaitTimeout), if there is a persistent IOException in trying to obtain the lock, this can be messaged included in the Lock obtain timed out that's raised. - Corrected javadoc in SimpleFSLockFactory: it previously said the wrong system property for overriding lock class via system properties - Fixed unhandled IOException when opening an IndexWriter for create, if the locks dir does not exist (just added lockDir.exists() check in clearAllLocks method of SimpleFSLockFactory NativeFSLockFactory. - Fixed a few small unrelated issues with TestLockFactory, and also fixed tests to accept NativeFSLockFactory as the default locking implementation for FSDirectory. - Fixed a typo in javadoc in FieldsReader.java - Added some more javadoc for the LockFactory.setLockPrefix -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search
[ http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12444742 ] Doron Cohen commented on LUCENE-686: An example of how current Lucene code relies on not having to close resoures, in PhraseQuery: ... scorer(IndexReader reader) { ... for (int i = 0; i terms.size(); i++) { TermPositions p = reader.termPositions((Term)terms.elementAt(i)); if (p == null) return null; - - - - change would be required here tps[i] = p; } If close() has to be respected this code would need to change to close all TermPositions that were obtained just before the one that was not found. Resources not always reclaimed in scorers after each search --- Key: LUCENE-686 URL: http://issues.apache.org/jira/browse/LUCENE-686 Project: Lucene - Java Issue Type: Bug Components: Search Environment: All Reporter: Ning Li Attachments: ScorerResourceGC.patch Resources are not always reclaimed in scorers after each search. For example, close() is not always called for term docs in TermScorer. A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444744 ] Doron Cohen commented on LUCENE-697: I can reproduce this by uncommenting this line. Interesting to notice that: (1) the sequence to next() next() skip() skip() next() next() (!= instead of ==) passes the tests. This is exoected because the problem is in the initialization (at least in PhraseQuery). (2) the sequence next() skip() next() does not work (..0x01)!=0 instead of 0x02)==0) does not pass. This is surprising to me - becuase seemingly there are no initialization issues here. But I think the cause is that, at least in PhraseQuery, it is not just an intialization issue. Yonik, this is uassigned, are you working on a fix for this? Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=all ] Doron Cohen reassigned LUCENE-697: -- Assignee: Doron Cohen Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley Assigned To: Doron Cohen If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-569) NearSpans skipTo bug
[ http://issues.apache.org/jira/browse/LUCENE-569?page=comments#action_12445284 ] Doron Cohen commented on LUCENE-569: It seems that having assert() in NearSpanOrdered.java now required Java 1.5 in order to compile Lucene. This would require 1.5 for running Lucene. Do we want to include this now? NearSpans skipTo bug Key: LUCENE-569 URL: http://issues.apache.org/jira/browse/LUCENE-569 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Hoss Man Assigned To: Hoss Man Attachments: common-build.assertions.patch, LUCENE-569.ubber.patch, NearSpans20060903.patch, NearSpansOrdered.java, NearSpansUnordered.java, SpanNearQuery20060622.patch, SpanScorer.explain.testcase.patch, TestNearSpans.java NearSpans appears to have a bug in skipTo that causes it to skip over some matching documents completely. I discovered this bug while investigating problems with SpanWeight.explain, but as far as I can tell the Bug is not specific to Explanations ... it seems like it could potentially result in incorrect matching in some situations where a SpanNearQuery is nested in another query such thatskipTo will be used ... I tried to create a high level test case to exploit the bug when searching, but i could not. TestCase exploiting the class using NearSpan and SpanScorer will follow... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=all ] Doron Cohen updated LUCENE-697: --- Attachment: sloppy_phrase_skipTo.patch This was tricky, for me anyhow, but I think I found it. The difference in scoring between using next() to using skipTo() (or a combination of these two) is caused by two (valid) orders of the sorted PhrasePositions. Currently PhrasePositions sorting is defined by doc and position, where position already considers the offset of the term within the (phrase) query. If however two TermPosition have the same doc and same position, the sort takes no decision, which falls down to one valid sort (by current sort definition). The difference between using next() and skipTo() in this regard is that skipTo() always calls sort(), sorting the entire set, while next() only calls sort() at initialization and then maintain the sorting as part of the scoring process. This would be clearer with the following example - taken from Yonik's test case that is failing now: - Doc1: w1 w3 w2 w3 zz - Query: w3 w2~2 When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have doc(2)=doc(w3)=1. Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and pos(w3)=3+0=3. So, after scoring doc1(w3 w2), if the sort result places pp(w2) at the top, we would also score for doc1(w3 w2). However, if pp(w3) is placed by the sort at the top (==smallest), we would not score also for doc1(w3 w2). Current behavior is inconsistent: skip() would take the two while next() won't, and I think it is possible to create a case where it would be the other way around. So definitely behavior should be made consistent. Next question to be asked is: Do we want to sum (or max) the frequency for both (or more cases)? I think yes, sum. To fix this I am changing PhrasePosition comparison, so that in case positions are equal, the actual doc position (ignoring offset in query phrase) is considered. In addition, I added missing calls to clear the priority queue before starting to sort and to mark that no more initialization is required when skipTo() is called. I tested with the sequence that Yonik added: - skip skip next next skip skip And also with the sequences: - skip skip skip skip skip skip - next next next next next next - skip next skip next skip next - next skip next skip next skip - next next skip skip next next The latter 5 cases are now commented out, the first case is in effect. This scoring code is still not feeling natural to me, so (actually as always) comments will be appreciated. - Doron Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley Assigned To: Doron Cohen Attachments: sloppy_phrase_skipTo.patch If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-569) NearSpans skipTo bug
[ http://issues.apache.org/jira/browse/LUCENE-569?page=comments#action_12445294 ] Doron Cohen commented on LUCENE-569: Chris Hostetter wrote: Really? ... the build.xml currently sets the javac -source and -target to 1.4 so if that were true i would except it to have failed, and the documentation for J2SE 1.4.2 indicates that assertion support exists in 1.4. while writting this i attempted an ant clean test using Java 1.4 and everything seemed to work fine. You are right Chris, my mistake, compilation passed for me with 1.5 but failed with 1.4 so I assumed this was the case, but apparently for 1.4 I had 1.3 for the source compatibility (in Eclipse). I changed to 1.4 and it passed with no problems. Sorry for this noise, Doron NearSpans skipTo bug Key: LUCENE-569 URL: http://issues.apache.org/jira/browse/LUCENE-569 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Hoss Man Assigned To: Hoss Man Attachments: common-build.assertions.patch, LUCENE-569.ubber.patch, NearSpans20060903.patch, NearSpansOrdered.java, NearSpansUnordered.java, SpanNearQuery20060622.patch, SpanScorer.explain.testcase.patch, TestNearSpans.java NearSpans appears to have a bug in skipTo that causes it to skip over some matching documents completely. I discovered this bug while investigating problems with SpanWeight.explain, but as far as I can tell the Bug is not specific to Explanations ... it seems like it could potentially result in incorrect matching in some situations where a SpanNearQuery is nested in another query such thatskipTo will be used ... I tried to create a high level test case to exploit the bug when searching, but i could not. TestCase exploiting the class using NearSpan and SpanScorer will follow... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=all ] Doron Cohen updated LUCENE-697: --- Lucene Fields: [Patch Available] (was: [New]) Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley Assigned To: Doron Cohen Attachments: sloppy_phrase_skipTo.patch If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12445507 ] Doron Cohen commented on LUCENE-665: Michael, I am not able to generate this with native locks. (did not try with lockless commits). Which brings me to think that native locks should be made default? There is another thing that bothers me with locks, in NFS or other shared fs situations: Locks are maintained in a specified folder, but a lock file name is derived from the full path of the index dir, actually the cannonical name of this dir. So, if the same index is accessed by two machines, the drive / mount / fs root of that index dir must be named the same in all the machines on which Lucene is invoked to access/maintain that index. The documentation for File.getCanonicalPath() says that it is system dependent. So I am not even sure how it can be guaranteed that Lucene used on Linux and Lucene used on Windows (say) that accesss the same index would be able to lock on the same index. And for two Windows machines, admin would have to verify that the index fs (samba/afs/nfs) mounts with the same drive letter. This seems like a limitation on one hand, and also as a source for possible problems, when users mis configure their mount names. I may be missing someting trivial here, because it seems too wrong to be true... I'll let the list comment on that... temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, FSWinDirectory_26_Sep_06.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12445724 ] Doron Cohen commented on LUCENE-665: Odd that just by using native locking, it stopped your issues. Agree. I did not expect that to happen, since indeed I saw in the past exceptions on renameFile, though most exceptions were in locks activity. So I ran it many times, with an antivirus scan, etc. But it always passes. Therefore I would not object to closing this issue - If I cannot test it I cannot fix it. But for the same reason, I would like to see native locks becoming the default. setLockPrefix() I'll take this one to a seprate thread in dev list. temporary file access denied on Windows --- Key: LUCENE-665 URL: http://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, FSWinDirectory_26_Sep_06.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2. FSDirectory_Retry_Logic.patch 3. Test_Output.txt-
[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12445795 ] Doron Cohen commented on LUCENE-697: An updated version of this patch - sloppy_phrase_skipTo.patch2. I modified QueryUtils.java (test util) to test all the sequences, not just one. It is also now quite easy to add a new sequence to be tested, if needed. Other changes in this patch remain: - PhraseQueue: this is the fix. - ExactPhraseScorer: added call to clear queue - not a must, but cleaner this way. - PhraseScorer: added mark that init done at skip - again not a must, just cleaner this way. All ant test tests pass. - Doron Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley Assigned To: Doron Cohen Attachments: sloppy_phrase_skipTo.patch, sloppy_phrase_skipTo.patch2 If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong
Index File Format - Example for frequency file .frq is wrong Key: LUCENE-706 URL: http://issues.apache.org/jira/browse/LUCENE-706 Project: Lucene - Java Issue Type: Improvement Components: Website Environment: not applicable Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Trivial Reported by Johan Stuyts - http://www.nabble.com/Possible-documentation-error--p7012445.html - Frequency file example says: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 22, 3 It should be: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 8, 3 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong
[ http://issues.apache.org/jira/browse/LUCENE-706?page=all ] Doron Cohen updated LUCENE-706: --- Lucene Fields: [New, Patch Available] (was: [New]) Index File Format - Example for frequency file .frq is wrong Key: LUCENE-706 URL: http://issues.apache.org/jira/browse/LUCENE-706 Project: Lucene - Java Issue Type: Improvement Components: Website Environment: not applicable Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Trivial Attachments: file-format-frq-example.patch Reported by Johan Stuyts - http://www.nabble.com/Possible-documentation-error--p7012445.html - Frequency file example says: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 22, 3 It should be: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 8, 3 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-706) Index File Format - Example for frequency file .frq is wrong
[ http://issues.apache.org/jira/browse/LUCENE-706?page=comments#action_12447049 ] Doron Cohen commented on LUCENE-706: Right, sorry, copied that hex data from an .frq of an index with a different example, where the frequencies were 1 in doc 6 and 3 in doc 10, so there you would get 2 * 6 + 1 = 13. For the correct example of freq 1 in doc 7 and 3 in doc 11 the .frq content is 0F 08 03 as it should be. (Meaning that the documentatin should still be fixed...;-) Index File Format - Example for frequency file .frq is wrong Key: LUCENE-706 URL: http://issues.apache.org/jira/browse/LUCENE-706 Project: Lucene - Java Issue Type: Improvement Components: Website Environment: not applicable Reporter: Doron Cohen Assigned To: Grant Ingersoll Priority: Trivial Attachments: file-format-frq-example.patch Reported by Johan Stuyts - http://www.nabble.com/Possible-documentation-error--p7012445.html - Frequency file example says: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 22, 3 It should be: For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts: 15, 8, 3 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=all ] Doron Cohen updated LUCENE-675: --- Attachment: timedata.zip I tried it and it is working nice! - 1st run downloaded the documents from the Web before starting to index. 2nd run started right off - as input docs are already in place - great. Seems the only output is what is printed to stdout, right? I got something like this: [echo] Working Directory: work [java] Testing 4 different permutations. [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 -- [java] # source=work\reuters-out, [EMAIL PROTECTED]:\devoss\lucene\java\trunk\contrib\benchmark\work\index [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no [java] # qd-0110 R W NT [body:salomon] [java] # qd-0111 R W T [body:salomon] [java] # qd-0100 R NW NT [body:salomon] ... [java] # qd-14011 NR W T [body:fo*] [java] # qd-14000 NR NW NT [body:fo*] [java] # qd-14001 NR NW T [body:fo*] [java] Start Time: Sun Nov 05 22:41:38 PST 2006 [java] - processed 500, run id=0 [java] - processed 1000, run id=0 [java] - processed 1500, run id=0 [java] - processed 2000, run id=0 [java] End Time: Sun Nov 05 22:41:48 PST 2006 [java] warm = Warm Index Reader [java] srch = Search Index [java] trav = Traverse Hits list, optionally retrieving document [java] # testData id operation runCnt recCnt rec/s avgFreeMem avgTotalMem [java] td-00_100_100 addDocument 1 2000472.0321 4493681 22611558 [java] td-00_100_100 optimize1 1 2.857143 4229488 22716416 [java] td-00_100_100 qd-0110-warm1 20004.0 4250992 22716416 [java] td-00_100_100 qd-0110-srch1 1 Infinity 4221288 22716416 ... [java] td-00_100_100 qd-4110-srch1 1 Infinity 3993624 22716416 [java] td-00_100_100 qd-4110-trav1 0 NaN 3993624 22716416 [java] td-00_100_100 qd-4111-warm1 20005.0 3853192 22716416 ... BUILD SUCCESSFUL Total time: 1 minute 0 seconds I think the infinity and NAN are caused by op time too short for divide-by-sec. This can be avoided by modifying getRate() in TimeData: public double getRate() { double rps = (double) count * 1000.0 / (double) (elapsed0 ? elapsed : 1); return rps; } I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy. It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like: - createIndex(params) - writeToIndex(params): - addDocs() - optimize() - readFromIndex(params): - searchIndex() - fetchData() ..and compose a test by an XML that says which of these simple jobs to run, with what params, in which order, serial/parallel, how long/often etc. Then creating a different test is as easy as creating a different XML that configures that test. On the other hand, chances are, I know, that most useful cases would be those already defined here - standard and micro-standard, so can ask why bothering changing to define these building blocks. I am not sure here, but thought I'll bring it up. About Using the driver - seems nice and clean to me. I don't know the Digester but it seems to read the config from the XML correctly. Other comments: 1. I think there is a redundant call to params.showRunData(params.getId()) in runBenchmark(File,Options); 2. Seems that rec/sec would be a bit more accurately computed by aggregating elapsed times (instead of rate) in showRunData() 3. If TimeData not found (only memData) I think additional 0.0 should be printed 4. columns allignments with tabs and floats is imperfect.:-) 5. It would be nice I think to also get a summary of the results by task - e.g. srch, optimize, something like: [java] # testData id operation runCnt recCnt rec/s avgFreeMem avgTotalMem [java] warm60 2000 42,628.88,235,758 23,048,192 [java] srch 120 1 571.48,300,613 23,048,192 [java] optimize 1 1 2.99,375,732 23,048,192 [java] trav 120107 30,517.88,326,046 23,048,192 [java] addDocument 1 2000 441.87,310,929 22,206,872 Attached timedata.zip has modifies TimeData.java and
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449117 ] Doron Cohen commented on LUCENE-675: I looked at extending the benchmark with: - different test scenarios, i.e. other sequences of operations. - multithreaded tests, e.g. several queries in parallel. - rate of events, e.g. 2 queries arriving per second, or one query per second in parallel with 20 new documents in a minute. - different data sources (input documents, queries). For this I made lots of changes to the benchmark code, using parts of it and rewriting other parts. I would like to submit this code in a few days - it is running already but some functionality is missing. I would like to describe how it works to hopefully get early feedback. There are several basic tasks defined - all extending an (abstract) class PerfTask: - AddDocTask - OptimizeTask - CreateIndexTask etc. To further extend the benchmark 'framework', new tasks can be added. Each task must implement the abstract method: doLogic(). For instance, in AddDocTask this method (doLogic) would call indexWriter.addDocument(). There are also setup() and tearDown() methods for performing work that should not be timed for that task. A special TaskSequence task contains other tasks. It is either parallel or sequential, which tells if it executes its child tasks serially or in parallel. TaskSequence also supports rate: the pace in which its child tasks are fired can be controlled. With these tasks, it is possible to describe a performance test 'algorithm' in a simple syntax. ('algorithm' may be too big a word for this...?) A test invocation takes two parameters: - test.properties - file with various config properties. - test.alg - file with the algorithm. By convention, for each task class OpNameTask, the command OpName is valid in test.alg. Adding a single document is done by: AddDoc Adding 3 documents: AddDoc AddDoc AddDoc Or, alternatively: { AddDoc } : 3 So, '{' and '}' indicate a serial sequence of (child) tasks. To fire 100 queries in a row: { Search } : 100 To fire 100 queries in parallel: [ Search ] : 100 So, '[' and ']' indicate a parallel group of tasks. To fire 100 queries in a row, 2 queries per second (120 per minute): { Search } : 100 : 120 Similar, but in parallel: [ Search ] : 100 : 120 A sequence task can be named for identifying it in reports: { QueriesA Search } : 100 : 120 And there are tasks that create reports. There are more tasks, and more to tell on the alg syntax, but this post is already long.. I find this quite powerful for perf testing. What do you (and you) think? - Doron Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=all ] Doron Cohen updated LUCENE-675: --- Attachment: tiny.alg tiny.properties I am attaching a sample tiny.* - the .alg and .properties files I currently use - I think they may help to understand how this works. Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449419 ] Doron Cohen commented on LUCENE-675: Sounds good. In this case I will add my stuff under a new package: org.apache.lucene.benchmark2. (this package would have no dependencies in org.apache.lucene.benchmark.). I will also add tarkets in buid.xml, and add .alg, and .alg files under conf. Makes sense? Do you already know when you are going to commit it? Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449779 ] Doron Cohen commented on LUCENE-675: Good point on names with numbers - I'm renaming the package to taskBenchmark, as I think of it as task sequence based, more than as propetries based. Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449947 ] Doron Cohen commented on LUCENE-675: Would be nice to get some feedback on what I already have at this point for the task based benchmark framework for Lucene. So I am packing it as a zip file. I would probably resubmit as a patch when Grant commits the current benchmark code. See attached taskBenchmark.zip. To try out taskBenchmark, unzip under contrib/benchmark, on top of Grant's benchmark.patch. This would do 3 changes: 1. replace build.xml - only change there is adding two targets: run-task-standard and run-task-micro-standard. 2. add 4 new files under conf: - task-standard.properties - task-standard.alg - task-micro-standard.properties - task-micro-standard.alg 3. add a src package 'taskBenchmark' side by side with current 'benchmark' package. To try it out, go to contrib/benchmark and try 'ant run-task-standard' or 'ant run-task-micro-standard'. See inside the .alg files for how a test is specified. The algorithm syntax and the entire package is documented in the package javadoc for taskBenchmark (package.html). Regards, Doron Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=all ] Doron Cohen updated LUCENE-675: --- Attachment: benchmark.byTask.patch I am attaching benchmark.byTask.patch - to be applied in the contrib/benchmark directory. Root package of byTask classes was modified to org.apache.lucene.benchmark.byTask, in the lines of Grant's suggestion - seems better cause it keeps all benchmark classes under lucene.benchmark. I added one a sample .alg under conf and added some documentation. Entry point - documentation wise - is the package doc for org.apache.lucene.benchmark.byTask. Thanks for any comments on this! PS. Before submitting the patch file, I tried to apply it myself on a clean version of the code, just to make sure that it works. But I got errors like this -- Could not retrieve revision 0 of ...\byTask\.. -- for every file under a new folder. So I am not sure if it is just my (Windows) svn patch applying utility, or is it really impossible to apply a patch that creates files in (yet) nonexistent directories. I searched Lucene mailing lists and SVN mailing lists and went again through the SVN book again but nowhere could I find what is the expected behavior for applying a patch containing new directories. In fact, svn diff would not even show you files that are new (again, this is the Windows svn 1.4.2 version). (I used Tortoise SVN to create the patch). This is rather annoying and I might be misunderstanding something basic about SVN, but I thought it'd be better to share this experience here - might save some time for others trying to apply this patch or other patches ... Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: http://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-717) src builds fail because of no lib directory
[ http://issues.apache.org/jira/browse/LUCENE-717?page=comments#action_12453672 ] Doron Cohen commented on LUCENE-717: That's because junit,jar is required for compiling and running the tests. (Guess we can't distribute junit.jar with Lucene.) This is from the commn-build.xml: ## JUnit not found. Please make sure junit.jar is in ANT_HOME/lib, or made available to Ant using other mechanisms like -lib or CLASSPATH. ## src builds fail because of no lib directory - Key: LUCENE-717 URL: http://issues.apache.org/jira/browse/LUCENE-717 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 2.0.0 Reporter: Hoss Man I just downloaded http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz and noticed that you can't compile and run the tests from that src build because it doesn't inlcude the lib dir (and the build file won't attempt to make it if it doesn't exist) ... [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz ... [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/ [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test ... test: [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test BUILD FAILED /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: /home/hossman/tmp/l2/lucene-2.0.0/lib not found. (it's refrenced in junit.classpath, but i'm not relaly sure why) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-717) src builds fail because of no lib directory
[ http://issues.apache.org/jira/browse/LUCENE-717?page=all ] Doron Cohen updated LUCENE-717: --- Lucene Fields: [New, Patch Available] (was: [New]) src builds fail because of no lib directory - Key: LUCENE-717 URL: http://issues.apache.org/jira/browse/LUCENE-717 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 2.0.0 Reporter: Hoss Man Attachments: common-build.xml.patch.txt I just downloaded http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz and noticed that you can't compile and run the tests from that src build because it doesn't inlcude the lib dir (and the build file won't attempt to make it if it doesn't exist) ... [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz ... [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/ [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test ... test: [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test BUILD FAILED /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: /home/hossman/tmp/l2/lucene-2.0.0/lib not found. (it's refrenced in junit.classpath, but i'm not relaly sure why) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-717) src builds fail because of no lib directory
[ http://issues.apache.org/jira/browse/LUCENE-717?page=comments#action_12453738 ] Doron Cohen commented on LUCENE-717: I'm ok with this... src builds fail because of no lib directory - Key: LUCENE-717 URL: http://issues.apache.org/jira/browse/LUCENE-717 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 2.0.0 Reporter: Hoss Man Attachments: common-build.xml.patch.txt I just downloaded http://mirrors.ibiblio.org/pub/mirrors/apache/lucene/java/lucene-2.0.0-src.tar.gz and noticed that you can't compile and run the tests from that src build because it doesn't inlcude the lib dir (and the build file won't attempt to make it if it doesn't exist) ... [EMAIL PROTECTED]:~/tmp/l2$ tar -xzvf lucene-2.0.0-src.tar.gz ... [EMAIL PROTECTED]:~/tmp/l2$ cd lucene-2.0.0/ [EMAIL PROTECTED]:~/tmp/l2/lucene-2.0.0$ ant test ... test: [mkdir] Created dir: /home/hossman/tmp/l2/lucene-2.0.0/build/test BUILD FAILED /home/hossman/tmp/l2/lucene-2.0.0/common-build.xml:169: /home/hossman/tmp/l2/lucene-2.0.0/lib not found. (it's refrenced in junit.classpath, but i'm not relaly sure why) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-708) Setup nightly build website links and docs
[ http://issues.apache.org/jira/browse/LUCENE-708?page=comments#action_12454375 ] Doron Cohen commented on LUCENE-708: Could official be the most recent release (currently 2.0)? So there would be: Official (2.0) Nightly 1.9.1 1.9 1.4.3 This way everyday users would not have to get into trunk_/_svn details for understanding what docs they are seeing. Setup nightly build website links and docs -- Key: LUCENE-708 URL: http://issues.apache.org/jira/browse/LUCENE-708 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Grant Ingersoll Assigned To: Grant Ingersoll Priority: Minor Per discussion on mailing list, we are going to setup a Nightly Build link on the website linking to the docs (and javadocs) generated by the nightly build process. The build process may need to be modified to complete this task. Going forward, the main website will, for the most part, only be updated per releases (I imagine exceptions will be made for News items and per committer's discretion). The Javadocs linked to from the main website will always be for the latest release. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
[ http://issues.apache.org/jira/browse/LUCENE-736?page=all ] Doron Cohen updated LUCENE-736: --- Attachment: sloppy_phrase_tests.patch.txt sloppy_phrase_tests.patch.txt contains: - two test cases added in TestPhraseQuery. These new tests currently fail. - skipTo() behavior tests that were originaly in issue 697. This too currently fails. Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: sloppy_phrase_tests.patch.txt This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
[ http://issues.apache.org/jira/browse/LUCENE-736?page=all ] Doron Cohen updated LUCENE-736: --- Attachment: sloppy_phrase_java.patch.txt perf-search-new.log perf-search-orig.log Attached sloppy_phrase_java.patch.txt is fixing the failing new tests. This also includes the skipTo() bug from issue 697. The fix does not guarantee that document A B C B A would score A B C~4 and C B A~4 the same. It does that for B C~2 and C B~2. This is because a general fix for that (at least the one that I devised) would be too expensive. Although this is an interesting case, I'd like to think it is not an important one. This fix comes with a performance cost: about 15% degradation in CPU activity of sloppy phrase scoring, as the attcahed perf logs show. Here is the summary of these tests: ...Operation..runCnt...recsPerRun.rec/s..elapsedSec Orig:..SearchSameRdr_3000..4.3000.216.1...55.52 New:...SearchSameRdr_3000..4.3000.187.8...63.91 I think that in a real life scenario - real index, real documents, real queries - this extra CPU will be shaded by IO, but I also belive we should refrain from slowing down search, so, unhappy with this degradation (anyone would:-), I would look for a other ways to fix this - ideas are welcome. Perf test was done using the task benchmark framework (see issue 675), The logs show also the queries that were searched. All tests pass with new code. Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: perf-search-new.log, perf-search-orig.log, sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
[ http://issues.apache.org/jira/browse/LUCENE-736?page=all ] Doron Cohen updated LUCENE-736: --- Attachment: sloppy_phrase.patch2.txt res-search-orig2.log res-search-new2.log The change to fix case 2 was not the main performance degradation cause. I agree with Yonik that case 2 is much more important than case 1. So I modified to fix case 2 but not case 1. Also extended the perf test to create also the reversed form of the sloppy phrases (slop increased for reversed cases so that queries would match docs). Cost of this fix dropped from 15% more CPU time to about 3%. I feel ok with this. .Operation..runCnt...recsPerRun...rec/s..elapsedSecavgUsedMemavgTotalMem Orig.SearchSameRdr_6000..4.6000...194.2..123.59.8,032,732.11,333,632 New..SearchSameRdr_6000..4.6000...187.5..128.02.8,172,258.11,333,632 Attached sloppy_phrase.patch2.txt has the updated fix, including both java and test parts. Some of the asserts in the new tests were commented out cause the patch takes decision not to fix case 1 above. Also attaching the updates perf test logs - res-search-orig2.log and res-search-new2.log. I did not compare scoring of similar cases between sloppy phrase and near spans and Paul suggested - perhaps next week - not sure this should hold progress with this issue. Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: perf-search-new.log, perf-search-orig.log, res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
[ http://issues.apache.org/jira/browse/LUCENE-736?page=all ] Doron Cohen updated LUCENE-736: --- Lucene Fields: [New, Patch Available] (was: [New]) Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: perf-search-new.log, perf-search-orig.log, res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
[ http://issues.apache.org/jira/browse/LUCENE-736?page=comments#action_12455422 ] Doron Cohen commented on LUCENE-736: There is a bug in my recent patch (sloppy_phrase.patch2.txt): - for the case of phrase with repetitions, some additional computation is required before starting each doc. - this does not affect the regular/common case of phrase with no repetitions. I extended the test to expose this and will commit an updated patch later today. Sloppy Phrase Scoring Misbehavior - Key: LUCENE-736 URL: http://issues.apache.org/jira/browse/LUCENE-736 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: perf-search-new.log, perf-search-orig.log, res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt This is an extension of https://issues.apache.org/jira/browse/LUCENE-697 In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring. 1) A phrase with a repeated word would be detected in a document although it is not there. I.e. document = A B D C E , query = B C B would not find this document (as expected), but query B C B~2 would find it. I think that no matter how large the slop is, this document should not be a match. 2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form. I.e. document = A B C B A would score differently for queries B C~2 and C B~2, although it is symmetric to both. I will attach test cases that show both these problems and the one reported by Yonik in 697. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
read/write .del as d-gaps when the deleted bit vector is sufficiently sparse - Key: LUCENE-738 URL: http://issues.apache.org/jira/browse/LUCENE-738 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion. For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk. Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time. I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
[ http://issues.apache.org/jira/browse/LUCENE-738?page=all ] Doron Cohen updated LUCENE-738: --- Attachment: del.dgap.patch.txt Patch added: del.dgap.patch.txt for the above optn (1) writing d-gaps for ids of deleted docs. Patch changes index format, but is backwards compatible. I still need to update the FileFormats document - will add that part of the patch later. read/write .del as d-gaps when the deleted bit vector is sufficiently sparse Key: LUCENE-738 URL: http://issues.apache.org/jira/browse/LUCENE-738 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen Attachments: del.dgap.patch.txt .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion. For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk. Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time. I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
[ http://issues.apache.org/jira/browse/LUCENE-738?page=all ] Doron Cohen updated LUCENE-738: --- Lucene Fields: [Patch Available] (was: [New]) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse Key: LUCENE-738 URL: http://issues.apache.org/jira/browse/LUCENE-738 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen Attachments: del.dgap.patch.txt .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion. For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk. Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time. I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
[ http://issues.apache.org/jira/browse/LUCENE-738?page=all ] Doron Cohen updated LUCENE-738: --- Attachment: FileFormatDoc.patch.txt FileFormat document updated to reflect this format change. read/write .del as d-gaps when the deleted bit vector is sufficiently sparse Key: LUCENE-738 URL: http://issues.apache.org/jira/browse/LUCENE-738 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen Attachments: del.dgap.patch.txt, FileFormatDoc.patch.txt .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion. For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk. Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time. I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457462 ] Doron Cohen commented on LUCENE-740: In addition to SnowballProgram bug fix there are few updates in snowball.tartarus.org comparing to snowball stemmers in Lucene, and Hungarian stemmer was added. Any reason not to update all the stemmers with this fix? Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=all ] Doron Cohen updated LUCENE-740: --- Attachment: snowball.patch.txt Updated + new stemmers and SnowballProgram fix from http://snowball.tartarus.org Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457605 ] Doron Cohen commented on LUCENE-740: Attached snowball.patch.txt has latest and greatest plus new test case in TestSnowball that demostrates this Kp stemmer bug. Lucene tests and contrib/snowball tests pass. Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12457619 ] Doron Cohen commented on LUCENE-740: Two comments: 1. Testing: There's only limited testing in Lucene's contrib for these stemmers - we could probably add a simple test for each stemmer. 2. Licensing: when attaching the patch I granted it for ASF inclusion. But this only covers my (minimal) changes to this code. Stemmers themselves go under Snowball licensing - http://snowball.tartarus.org/license.php Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-756) Maintain norms in a single file .nrm
Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Reporter: Doron Cohen Priority: Minor Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen reassigned LUCENE-756: -- Assignee: Doron Cohen Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Attachment: nrm.patch.txt Attached patch - nrm.patch.txt - modifies field norms maintenance to a single .nrm file. Modification is backwards compatible - existing indexes with norms in a file per norm are read. - the first merge would create a single .nrm file. All tests pass. No performance degtadations were observed as result of this change, but my tests so far were not very extensive. Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Lucene Fields: [Patch Available] (was: [New]) Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Component/s: Index Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Attachment: (was: nrm.patch.txt) Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Attachment: nrm.patch.txt Replacing the patch file (prev file was garbage - svn stat instead of svn diff). Few words on how this patch works: - segment.nrm file was added. - addDocument (DocumentWriter) still writes each norm to a separate file - but that's in memory, - at merge, all norms are written to a single file. - CFS now also maintains all norms in a single file. - IndexWriter merge-decision now considers hasSeparateNorms() not only for CFS but also for non compound. - SegmentReader.openNorms() still creates ready-to-use/load Norm objects (which would read the norms only when needed). But the Norm object is now assigned a normSeek value, which is nonzero if the norm file is segment.nrm. - existing indexes, prior to this change, are managed the same way that segments resulted of addDocument are managed. Tests: - I verified that also the (contrib) tests for FieldNormModifier and LengthNormModofier are working. Remaining: - I might add a test. - more benchmarking? - update fileFormat document. Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=comments#action_12460292 ] Doron Cohen commented on LUCENE-756: Does this mean a separate file outside the final .cfs files? Oh no - there's a single .nrm file in the .cfs file (instead of multiple .fN files in the .cfs file). As before, only .sN files (separated norm files) are outside of .cfs file. Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=comments#action_12460316 ] Doron Cohen commented on LUCENE-756: Thanks for the comments, Doug. You're right of course, I will add both the header and the constant. (that would be either today or only in a week from now.) Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ] Doron Cohen updated LUCENE-756: --- Attachment: nrm.patch.2.txt nrm.patch.2.txt: Updated as Doug suggested: - .nrm extension now maintained in a constant . - .nrm file now has a 4 bytes header. And, fileFormat document is updated. Also, I checked again that the seeks for the various field norms are lazy - performed only when bytes are actually read with refill(). Maintain norms in a single file .nrm Key: LUCENE-756 URL: http://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.2.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462069 ] Doron Cohen commented on LUCENE-756: I am updating the patch (nrm.patch.3.txt): - using a single constant for the norms file extension: static final String NORMS_EXTENSION = nrm; (This is more in line with existing extension constants in the code.) (As a side comment, there are various extension names (e.g. .cfs) in the code that are also candidate for factoring as a constant, but this is a separate issue.) - adding a test - TestNorms This test verifies that norm values assigned with field.setBoost() are preserved during the life cycle of an index, including adding documents, updating norms values (separate norms), addIndexes(), and optimize. All tests pass. On my side this is ready to go in. Maintain norms in a single file .nrm Key: LUCENE-756 URL: https://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.2.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-756: --- Attachment: nrm.patch.3.txt Maintain norms in a single file .nrm Key: LUCENE-756 URL: https://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462287 ] Doron Cohen commented on LUCENE-675: Grant, thanks for trying this out - I will update the patch shortly. I am using this for benchmarking - quite easy to add new stuff - and in fact I added some stuff lately but did not update here because wasn't sure if others are interested. I will verify what I have with svn head and pack it here as an updated patch. Regards, Doron Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: https://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Priority: Minor Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-675: --- Attachment: byTask.2.patch.txt Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: https://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Priority: Minor Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462774 ] Doron Cohen commented on LUCENE-756: Thanks for commiting this Yonik! Seems the added test TestNorms was not commited..? Maintain norms in a single file .nrm Key: LUCENE-756 URL: https://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-140) docs out of order
[ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463176 ] Doron Cohen commented on LUCENE-140: Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things - (1) the sequence of ops brought by Jason is wrong: -a- Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index) -b- Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers -c- Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads) (*) -d- Close the IndexWriter (*) -e- Open another IndexReader (#2) over the index -f- Delete the previously found documents by their document ID numbers using reader #2 -g- Close the #2 reader -h- Create another IndexWriter (#2) and re-add the updated documents -i- Close the IndexWriter #2 -j- Close the original IndexReader (#1) and open a new reader for general searching Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case. (2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good. However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike! I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the gotException assert - if you comment this assert the test pass. The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it. public void testOutOfOrder () throws IOException { String tempDir = System.getProperty(java.io.tmpdir); if (tempDir == null) { throw new IOException(java.io.tmpdir undefined, cannot run test: +getName()); } File indexDir = new File(tempDir, lucenetestindexTemp); Directory dir = FSDirectory.getDirectory(indexDir, true); boolean create = true; int numDocs = 0; int maxDoc = 0; while (numDocs 100) { IndexWriter iw = new IndexWriter(dir,anlzr,create); create = false; iw.setUseCompoundFile(false); for (int i=0; i2; i++) { Document d = new Document(); d.add(new Field(body,body+i,Store.NO,Index.UN_TOKENIZED)); iw.addDocument(d); } iw.optimize(); iw.close(); IndexReader ir = IndexReader.open(dir); numDocs = ir.numDocs(); maxDoc = ir.maxDoc(); assertEquals(numDocs,maxDoc); for (int i=7; i =-1; i--) { try { ir.deleteDocument(maxDoc+i); } catch (ArrayIndexOutOfBoundsException e) { } } ir.close(); } } Mike, do you agree? (3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit). This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670messageID=4103657 Not sure if this can be the case here but at least we can discuss whether it is better to always store the length. docs out of order - Key: LUCENE-140 URL: https://issues.apache.org/jira/browse/LUCENE-140 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: unspecified Environment: Operating System: Linux Platform: PC Reporter: legez Assigned To: Michael McCandless Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar Hello, I can not find out, why (and what) it is happening all the time. I got an exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at
[jira] Commented: (LUCENE-140) docs out of order
[ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463483 ] Doron Cohen commented on LUCENE-140: Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false? I suspect so, because otherwise, old .del files would have been deleted. If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file. This would possibly expose the bug fixed by Michael. I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem. In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared. ...and if this is not the case, we would need to look for something else... docs out of order - Key: LUCENE-140 URL: https://issues.apache.org/jira/browse/LUCENE-140 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: unspecified Environment: Operating System: Linux Platform: PC Reporter: legez Assigned To: Michael McCandless Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch Hello, I can not find out, why (and what) it is happening all the time. I got an exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) at Optimize.main(Optimize.java:29) It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find it neither in download nor in version list in this form). Everything seems OK. I can search through index, but I can not optimize it. Even worse after this exception every time I add new documents and close IndexWriter new segments is created! I think it has all documents added before, because of its size. My index is quite big: 500.000 docs, about 5gb of index directory. It is _repeatable_. I drop index, reindex everything. Afterwards I add a few docs, try to optimize and receive above exception. My documents' structure is: static Document indexIt(String id_strony, Reader reader, String data_wydania, String id_wydania, String id_gazety, String data_wstawienia) { Document doc = new Document(); doc.add(Field.Keyword(id, id_strony )); doc.add(Field.Keyword(data_wydania, data_wydania)); doc.add(Field.Keyword(id_wydania, id_wydania)); doc.add(Field.Text(id_gazety, id_gazety)); doc.add(Field.Keyword(data_wstawienia, data_wstawienia)); doc.add(Field.Text(tresc, reader)); return doc; } Sincerely, legez -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-675: --- Attachment: byTask.jre1.4.patch.txt Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: https://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Priority: Minor Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463830 ] Doron Cohen commented on LUCENE-675: Oops... I had the impression that compiling with compliance level 1.4 is sufficient to prevent this, but guess I need to read again what that compliance level setting guarantees exactly. Anyhow there are a 3 things that require 1.5: - Boolean.parseBoolean() -- Boolean.valueOf().booleanValue() - String.contains() -- indexOf() - Class.getSimpleName() -- ? Modifying Class.getSimpleName() to Class.getName() would not be very nice - queries prints and task names prints would be quite ugly. To fix that I added a method simpleName(Class) to byTask.util.Format. I am attaching an updated patch - byTask.jre1.4.patch.txt - that includes this method and removes the Java 1.5 dependency. Thanks for catching this! Doron Lucene benchmark: objective performance test for Lucene --- Key: LUCENE-675 URL: https://issues.apache.org/jira/browse/LUCENE-675 Project: Lucene - Java Issue Type: Improvement Reporter: Andrzej Bialecki Assigned To: Grant Ingersoll Priority: Minor Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests. Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-741) Field norm modifier (CLI tool)
[ https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464105 ] Doron Cohen commented on LUCENE-741: I was looking at what it would take to make this work with .nrm file as well. I expected there will be a test that fails currently, but there is none. So I looked into the tests and the implementation and have a few questions: (1) under contrib, FieldNormModifier and LengthNormModifier seem quite similar, right? The first one sets with: - reader.setNorm(d, fieldName, - sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d]))); The latter with: - byte norm = sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d])); - reader.setNorm(d, fieldName, norm); Do we need to keep both? (2) TestFieldNormModifier.testFieldWithNoNorm() calls resetNorms() for a field that does not exist. Some work is done by the modifier to collect the term frequencies, and then reader.setNorm is called but it does nothing, because there are no norms. And indeed the test verifies that there are still no norms for this field. Confusing I think. For some reason I assumed that calling resetNorms() for a field that has none, would implicitly set omitNorms to false for that field and compute it - the inverse of killNorms(). Since this is not the case, perhaps resetNorms should throw an exception in this case? (3) I would feel safer about this feature if the test was more strict - something like TestNorms - have several fields, modify some, each in a unique way, remove some others, then at the end verify that all the values of each field norms are exactly as expected. (4) For killNorms to work, you can first revert the index to not use .nrm, and then kill as before. The code knows to read .fN files, for both backwards compatibility, and for reading segments created be DocumentWriter. The following steps will do this: - read the norms using reader.norm(field) - write into .fN files - remove .nrm file - modify segmentInfo to know that it has no .nrm file. (5) It would have been more efficient to optimize (and remove the .nrm file) once in the application, so perhaps modify the public API to take an array of fields and operate on all? Field norm modifier (CLI tool) -- Key: LUCENE-741 URL: https://issues.apache.org/jira/browse/LUCENE-741 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: LUCENE-741.patch, LUCENE-741.patch I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS. This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 (LengthNormModifier contrib from Chris). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)
[ https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-741: --- Attachment: for.nrm.patch Field norm modifier (CLI tool) -- Key: LUCENE-741 URL: https://issues.apache.org/jira/browse/LUCENE-741 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS. This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 (LengthNormModifier contrib from Chris). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)
[ https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-741: --- Attachment: (was: for.nrm.patch) Field norm modifier (CLI tool) -- Key: LUCENE-741 URL: https://issues.apache.org/jira/browse/LUCENE-741 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: LUCENE-741.patch, LUCENE-741.patch I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS. This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 (LengthNormModifier contrib from Chris). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-741) Field norm modifier (CLI tool)
[ https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-741: --- Attachment: for.nrm.patch Field norm modifier (CLI tool) -- Key: LUCENE-741 URL: https://issues.apache.org/jira/browse/LUCENE-741 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS. This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 (LengthNormModifier contrib from Chris). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-741) Field norm modifier (CLI tool)
[ https://issues.apache.org/jira/browse/LUCENE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464109 ] Doron Cohen commented on LUCENE-741: Attached for.nrm.patch was very noisy - so I replaced it with one created with svn diff -x --ignore-eol-style contrib/miscellaneous It is relative to trunk. A test is added to TestFieldNormModifier.java - testModifiedNormValuesCombinedWithKill - that verifies exactly what are the values of norms after modification. FieldNormModifier modified to handle .nrm file as outlined above. Field norm modifier (CLI tool) -- Key: LUCENE-741 URL: https://issues.apache.org/jira/browse/LUCENE-741 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: for.nrm.patch, LUCENE-741.patch, LUCENE-741.patch I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS. This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 (LengthNormModifier contrib from Chris). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-665) temporary file access denied on Windows
[ https://issues.apache.org/jira/browse/LUCENE-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464250 ] Doron Cohen commented on LUCENE-665: Hi Michael, Funny that I got this email with reply-to to you rather than the list. Funnier part is that I really wanted to reply you directly rather than the list. Is JIRA a mind reader? Yes, I would like to close the issue - I already said that in my Oct 30 post. I would like to do this myself - should I close or resolve the issue? or perhaps first resolve and then close? I think I read somewhere the life cycle of an issue but I cannot find it. I am also wondering if it should be with won't fix or duplicate? Thanks, Doron atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464174] temporary file access denied on Windows --- Key: LUCENE-665 URL: https://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, FSWinDirectory_26_Sep_06.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the
[jira] Resolved: (LUCENE-665) temporary file access denied on Windows
[ https://issues.apache.org/jira/browse/LUCENE-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-665. Resolution: Won't Fix With lockless commits this is no longer reproducable, and although theoretically it seems that in some cases it should be able to reproduce this, practice suggests otherwise, and there seems to be no sufficient justification to introduce retry logic (which is not a 100% solution anyhow). temporary file access denied on Windows --- Key: LUCENE-665 URL: https://issues.apache.org/jira/browse/LUCENE-665 Project: Lucene - Java Issue Type: Bug Components: Store Affects Versions: 2.0.0 Environment: Windows Reporter: Doron Cohen Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, FSWinDirectory_26_Sep_06.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java When interleaving adds and removes there is frequent opening/closing of readers and writers. I tried to measure performance in such a scenario (for issue 565), but the performance test failed - the indexing process crashed consistently with file access denied errors - cannot create a lock file in lockFile.createNewFile() and cannot rename file. This is related to: - issue 516 (a closed issue: TestFSDirectory fails on Windows) - http://issues.apache.org/jira/browse/LUCENE-516 - user list questions due to file errors: - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. I noticed that the problem is more frequent when locks are created on one disk and the index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing service might be related - keeping files busy for a while, but don't know for sure. After experimenting with it I conclude that these problems - at least in my scenario - are due to a temporary situation - the FS, or the OS, is *temporarily* holding references to files or folders, preventing from renaming them, deleting them, or creating new files in certain directories. So I added to FSDirectory a retry logic in cases the error was related to Access Denied. This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html - there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the *hope* that a access-denied situation would vanish after a small delay, and the retry would succeed. I modified FSDirectory this way for Access Denied errors during creating a new files, renaming a file. This worked fine for me. The performance test that failed before, now managed to complete. There should be no performance implications due to this modification, because only the cases that would otherwise wrongly fail are now delaying some extra millis and retry. I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes to FSDirectory. All ant test tests pass with this patch. Also attaching a test case that demostrates the problem - at least on my machine. There two tests cases in that test file - one that works in system temp (like most Lucene tests) and one that creates the index in a different disk. The latter case can only run if the path (D: , tmp) is valid. It would be great if people that experienced these problems could try out this patch and comment whether it made any difference for them. If it turns out useful for others as well, including this patch in the code might help to relieve some of those frustration user cases. A comment on state of proposed patch: - It is not a ready to deploy code - it has some debug printing, showing the cases that the retry logic actually took place. - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently defined by a constant. - Should a call to gc() be added? (I think not.) - Should the retry be attempted also on non access-denied exceptions? (I think not). - I feel it is somewhat woodoo programming, but though I don't like it, it seems to work... Attached files: 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch and passes with the patch. 2. FSDirectory_Retry_Logic.patch 3. Test_Output.txt- output of the test with the patch, on my XP. Only the createNewFile() case had to be bypassed in this test, but for another program I also saw the renameFile() being bypassed. - Doron --
[jira] Commented: (LUCENE-771) Change default write lock file location to index directory (not java.io.tmpdir)
[ https://issues.apache.org/jira/browse/LUCENE-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464395 ] Doron Cohen commented on LUCENE-771: Is that true? I thought that for previous format changes, the combination of - { (1) point-in-time index reading by readers (2) backwards compatibility (3) locks } - allowed not to require this. Change default write lock file location to index directory (not java.io.tmpdir) --- Key: LUCENE-771 URL: https://issues.apache.org/jira/browse/LUCENE-771 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor Fix For: 2.1 Now that readers are read-only, we no longer need to store lock files in a different global lock directory than the index directory. This has been a source of confusion and caused problems to users in the past. Furthermore, once the write lock is stored in the index directory, it no longer needs the big digest prefix that was previously required to make sure lock files in the global lock directory, from different indexes, did not conflict. This way, all files related to an index will appear in a single directory. And you can easily list that directory to see if a write.lock is present to check whether a writer is open on the index. Note that this change just affects how FSDirectory creates its default lockFactory if no lockFactory was specified. It is still possible (just no longer the default) to pick a different directory to store your lock files by pre-instantiating your own LockFactory. As part of this I would like to remove LOCK_DIR and the no-argument constructor, in SimpleFSLockFactory and NativeFSLockFactory. I don't think we should have the notion of a global default lock directory anymore. This is actually an API change. However, neither SimpleFSLockFactory nor NativeFSLockFactory haver been released yet, so I think this API removal is allowed? Finally I want to deprecate (but not yet remove, because this has been in the API for many releases) the static LOCK_DIR that's in FSDirectory. But it's now entirely unused. See here for discussion leading to this: http://www.gossamer-threads.com/lists/lucene/java-dev/43940 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465260 ] Doron Cohen commented on LUCENE-756: Michael, I like this improvement! (At first I considered adding such FORMAT level but decided that it is not worth it, - aiming backwards compatibility with pre-lockless indexes. Then I had to add that file check - wrong trade-off indeed.) Two minor comments: - getHasMergedNorms() is private and now the method has no logic - I would remove that method and refer to hasMergedNorms instead. - the term merged (in hasMergedNorms) is a little overloaded with other semantics (in Lucene), though I cannot think of other matching descriptive (short) term. Thanks for improving this, Doron Maintain norms in a single file .nrm Key: LUCENE-756 URL: https://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: index.premergednorms.cfs.zip, index.premergednorms.nocfs.zip, LUCENE-756-Jan16.patch, LUCENE-756-Jan16.Take2.patch, nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-781) NPe in MultiReader.isCurrent() and getVersion()
[ https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466636 ] Doron Cohen commented on LUCENE-781: I checked - the fix is working and code seems right. While we are looking at this, there are a few more IndexReader methods which are not implemented by MultiReader. These 3 methods seems ok: - document(int) would work because IndexReader would send to document(int,FieldSelector) which is implemented in MultiReader. - termDocs(Term), - termPositions(Term) would both work because IndexReader implementations goes to termDocs() or to termPositions(), which both are implemented in MultiReader. These 3 methods should probably be fixed: - isOptimized() would fail - similar to isCurrent() - setNorm(int, String, float) would fail too, similar reason. - directory() would not fail, but fall to return the directory of reader[0], is this a correct behavior? this is because MultiReader() (constructor) calls super with reader[0] - again, I am not sure, is this correct? (why allowing to create a multi-reader with no readers at all?) NPe in MultiReader.isCurrent() and getVersion() --- Key: LUCENE-781 URL: https://issues.apache.org/jira/browse/LUCENE-781 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Daniel Naber Attachments: multireader.diff, multireader_test.diff I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase. For getVersion(), we should throw a better exception that NPE. I will commit unless someone objects or has a better idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-781) NPE in MultiReader.isCurrent() and getVersion()
[ https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466852 ] Doron Cohen commented on LUCENE-781: I thought it would not break MultiReader, just do unnecessary work for that method...? Same new test (using that (readers[]) constructor) would fail also in previous versions. I think main difference is that for the MultiReader created inside IndexReader, (1) all readers share the same directory, and (2) it maintains a SegmentsInfos read from that single directory. Now this is not the case for the other (but still valid (?)) usage of MultiReader - because there is no single directory (well, not necessarily) and hence no SegmentInfos for the MultiReader. So it seems a possible fix would be: - define a boolean e.g. isWholeIndex predicate in MultiReader - would be true when constructed with a non null dir and a non null segmentInfos - base operation upon it: - if isWholeIndex call super.isCurrent() otherwise do the (multi) logic in current fix. NPE in MultiReader.isCurrent() and getVersion() --- Key: LUCENE-781 URL: https://issues.apache.org/jira/browse/LUCENE-781 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Daniel Naber Attachments: multireader.diff, multireader_test.diff I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase. For getVersion(), we should throw a better exception that NPE. I will commit unless someone objects or has a better idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-781) NPE in MultiReader.isCurrent() and getVersion()
[ https://issues.apache.org/jira/browse/LUCENE-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467953 ] Doron Cohen commented on LUCENE-781: One could write an application that groups readers to multiReaders in more than 1 level, i.e. r1,r2,r3 grouped to rr1, r4,r5,r6 grouped to rr2, rr1,rr2 grouped to rrr.If rrr.isCurrent() throws unsupported, the application needs to question recursively. I am not aware of such an application, so you could argue this is only theoretic, still it demonstrates a strength of Lucene. Also, here too, as argued above, even if the answer is false (not current), the application would need to apply the same recursive logic to reopen the non-current reader and reconstruct the multi-reader. So I agree it is valid to throw unsupported. Just that it feels a bit uncomfortable to throw unsupported for existing API of a method with well defined meaning that is quite easy to implement (relying on that anyhow it was never implemented correctly). NPE in MultiReader.isCurrent() and getVersion() --- Key: LUCENE-781 URL: https://issues.apache.org/jira/browse/LUCENE-781 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Daniel Naber Attachments: multireader.diff, multireader_test.diff I'm attaching a fix for the NPE in MultiReader.isCurrent() plus a testcase. For getVersion(), we should throw a better exception that NPE. I will commit unless someone objects or has a better idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-788) contrib/benchmark assumes Locale.US for parsing dates in Reuters collection
[ https://issues.apache.org/jira/browse/LUCENE-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-788: --- Attachment: 788_benchmark_parseDate_locale_Jan_27.patch Locale.US passed to SimpleDateFormat constructor. contrib/benchmark assumes Locale.US for parsing dates in Reuters collection --- Key: LUCENE-788 URL: https://issues.apache.org/jira/browse/LUCENE-788 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Fix For: 2.1 Attachments: 788_benchmark_parseDate_locale_Jan_27.patch SimpleDateFormat used for parsing dates in Reuters documents is instantiated without specifying a locale. So it is using the default locale. If that happens to be US, it will work. But for another locale a parse exception is likely. Affects both StandardBenchmarker and ReutersDocMaker. Fix is trivial - specify Locale.US for SimpleDateFormat's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-788) contrib/benchmark assumes Locale.US for parsing dates in Reuters collection
[ https://issues.apache.org/jira/browse/LUCENE-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-788: --- Lucene Fields: [New, Patch Available] (was: [New]) contrib/benchmark assumes Locale.US for parsing dates in Reuters collection --- Key: LUCENE-788 URL: https://issues.apache.org/jira/browse/LUCENE-788 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.1 Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Fix For: 2.1 Attachments: 788_benchmark_parseDate_locale_Jan_27.patch SimpleDateFormat used for parsing dates in Reuters documents is instantiated without specifying a locale. So it is using the default locale. If that happens to be US, it will work. But for another locale a parse exception is likely. Affects both StandardBenchmarker and ReutersDocMaker. Fix is trivial - specify Locale.US for SimpleDateFormat's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib
[ https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-804: --- Attachment: 804.build.xml.patch 804.build.xml.patch removes loadable jars from the src_dist and adds back in jars that are (currently) not downloadable. This allows src_dist to compile contribs (or even to re-dist). src-dist size effect of this - reduced from 8.9 MB to 6.8 MB. build.xml: result of dist-src should support build-contrib -- Key: LUCENE-804 URL: https://issues.apache.org/jira/browse/LUCENE-804 Project: Lucene - Java Issue Type: Task Components: Other Affects Versions: 2.1 Reporter: Doron Cohen Priority: Minor Fix For: 2.1 Attachments: 804.build.xml.patch Currently the packed src distribution would fail to run ant build-contrib. It would be much nicer if that work. In fact, would be nicer if you could even re-pack with it. For now I marked this for 2.1, although I am not yet sure if this is a stopper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib
[ https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-804: --- Lucene Fields: [New, Patch Available] (was: [New]) build.xml: result of dist-src should support build-contrib -- Key: LUCENE-804 URL: https://issues.apache.org/jira/browse/LUCENE-804 Project: Lucene - Java Issue Type: Task Components: Other Affects Versions: 2.1 Reporter: Doron Cohen Priority: Minor Fix For: 2.1 Attachments: 804.build.xml.patch Currently the packed src distribution would fail to run ant build-contrib. It would be much nicer if that work. In fact, would be nicer if you could even re-pack with it. For now I marked this for 2.1, although I am not yet sure if this is a stopper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-804) build.xml: result of dist-src should support build-contrib
[ https://issues.apache.org/jira/browse/LUCENE-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-804: --- Lucene Fields: [Patch Available] (was: [Patch Available, New]) Affects Version/s: (was: 2.1) Fix Version/s: (was: 2.1) Assignee: Doron Cohen [1] Modifying 'fix version' to not be 2.1, thereby clarifying that, since releases are to be more frequent, this should not be regarded as a stopper to release 2.1. [2] I would like to commit this in a day or so (unless anyone points a problem with this). build.xml: result of dist-src should support build-contrib -- Key: LUCENE-804 URL: https://issues.apache.org/jira/browse/LUCENE-804 Project: Lucene - Java Issue Type: Task Components: Other Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: 804.build.xml.patch Currently the packed src distribution would fail to run ant build-contrib. It would be much nicer if that work. In fact, would be nicer if you could even re-pack with it. For now I marked this for 2.1, although I am not yet sure if this is a stopper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-808) bufferDeleteTerm in IndexWriter might flush prematurely
bufferDeleteTerm in IndexWriter might flush prematurely --- Key: LUCENE-808 URL: https://issues.apache.org/jira/browse/LUCENE-808 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.1 Reporter: Doron Cohen Successive calls to remove-by-the-same-term would increment numBufferedDeleteTerms although all but the first are no op if no docs were added in between. Hence deletes would be flushed too soon. It is a minor problem, should be rare, but it seems cleaner to fix this. Attached patch also fixes TestIndexWriterDelete.testNonRAMDelete() which somehow relied on this behavior. All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]