[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749450#action_12749450 ] Chuck Williams commented on LUCENE-600: --- I contributed the first patch to make flush-by-size possible; see Lucene-709. There is no incompatibility with ParallelWriter, even the early version contributed here 3 years ago. We've been doing efficient updating of selected mutable fields now for a long time and filed for a patent on the method. See published patent application 20090193406. ParallelWriter companion to ParallelReader -- Key: LUCENE-600 URL: https://issues.apache.org/jira/browse/LUCENE-600 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Chuck Williams Priority: Minor Attachments: ParallelWriter.patch A new class ParallelWriter is provided that serves as a companion to ParallelReader. ParallelWriter meets all of the doc-id synchronization requirements of ParallelReader, subject to: 1. ParallelWriter.addDocument() is synchronized, which might have an adverse effect on performance. The writes to the sub-indexes are, however, done in parallel. 2. The application must ensure that the ParallelReader is never reopened inside ParallelWriter.addDocument(), else it might find the sub-indexes out of sync. 3. The application must deal with recovery from ParallelWriter.addDocument() exceptions. Recovery must restore the synchronization of doc-ids, e.g. by deleting any trailing document(s) in one sub-index that were not successfully added to all sub-indexes, and then optimizing all sub-indexes. A new interface, Writable, is provided to abstract IndexWriter and ParallelWriter. This is in the same spirit as the existing Searchable and Fieldable classes. This implementation uses java 1.5. The patch applies against today's svn head. All tests pass, including the new TestParallelWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749656#action_12749656 ] Chuck Williams commented on LUCENE-600: --- The version attached here is from over 3 years ago. Our version has evolved along with Lucene and the whole apparatus is fully functional with the latest lucene. The fields in each subindex are disjoint. A logical Document is the collection of all fields from each real Document in each real subindex with same doc-id (i.e., the model Doug started with ParallelReader). There is no issue with deletion by query or term as it deletes the whole logical Document. Field updates in our scheme don't use deletion. Merge-by-size is only an issue if you allow it to be decided independently in each subindex. In practice that is not very important since one subindex is size-dominant (the one containing the document body field). One can merge-by-size that subindex and force the others to merge consistently. The only reason for the corresponding-segment constraint is that deletion changes doc-id's by purging deleted documents. I know some Lucene apps address this by never purging deleted documents, which is ok in some domains where deletion is rare. I think there are other ways to resolve it as well. ParallelWriter companion to ParallelReader -- Key: LUCENE-600 URL: https://issues.apache.org/jira/browse/LUCENE-600 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Chuck Williams Priority: Minor Attachments: ParallelWriter.patch A new class ParallelWriter is provided that serves as a companion to ParallelReader. ParallelWriter meets all of the doc-id synchronization requirements of ParallelReader, subject to: 1. ParallelWriter.addDocument() is synchronized, which might have an adverse effect on performance. The writes to the sub-indexes are, however, done in parallel. 2. The application must ensure that the ParallelReader is never reopened inside ParallelWriter.addDocument(), else it might find the sub-indexes out of sync. 3. The application must deal with recovery from ParallelWriter.addDocument() exceptions. Recovery must restore the synchronization of doc-ids, e.g. by deleting any trailing document(s) in one sub-index that were not successfully added to all sub-indexes, and then optimizing all sub-indexes. A new interface, Writable, is provided to abstract IndexWriter and ParallelWriter. This is in the same spirit as the existing Searchable and Fieldable classes. This implementation uses java 1.5. The patch applies against today's svn head. All tests pass, including the new TestParallelWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749660#action_12749660 ] Chuck Williams commented on LUCENE-600: --- Erratum: deletion changes doc-id's by purging deleted documents -- *merging* changes doc-id's by purging deleted documents ParallelWriter companion to ParallelReader -- Key: LUCENE-600 URL: https://issues.apache.org/jira/browse/LUCENE-600 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Chuck Williams Priority: Minor Attachments: ParallelWriter.patch A new class ParallelWriter is provided that serves as a companion to ParallelReader. ParallelWriter meets all of the doc-id synchronization requirements of ParallelReader, subject to: 1. ParallelWriter.addDocument() is synchronized, which might have an adverse effect on performance. The writes to the sub-indexes are, however, done in parallel. 2. The application must ensure that the ParallelReader is never reopened inside ParallelWriter.addDocument(), else it might find the sub-indexes out of sync. 3. The application must deal with recovery from ParallelWriter.addDocument() exceptions. Recovery must restore the synchronization of doc-ids, e.g. by deleting any trailing document(s) in one sub-index that were not successfully added to all sub-indexes, and then optimizing all sub-indexes. A new interface, Writable, is provided to abstract IndexWriter and ParallelWriter. This is in the same spirit as the existing Searchable and Fieldable classes. This implementation uses java 1.5. The patch applies against today's svn head. All tests pass, including the new TestParallelWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055 ] Chuck Williams commented on LUCENE-1052: I agree a general configuration system would be much better. Doug. we use a similar method to what you described in our application. TermInfosConfigurer is slightly different though since the desired config is a method that implements a formula, rather than just a value. This could still be done more generally by allowing methods as well as properties or setters on a higher level configuration object. I didn't want to take on the broader issue just for this feature. Michael I agree with both of your points. I'd be happy to clean up this patch if you guys provide some guidance for what would make it acceptable to commit. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136 ] Chuck Williams commented on LUCENE-1052: I can report that in our application having a formula is critical. We have no control over the content our users index, nor in fact do they. These are arbitrary documents. We find a surprising number of them contain embedded encoded binary data. When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space. Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically. I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution The current patch does not do that. It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters. At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class. A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters. Here is another possible solution. My current thought is that the bound should always be a multiple of sqrt(numDocs). E.g., see Heap's Law here: http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html I'm currently using this formula in my TermInfosConfigurer: int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); This has Heap's law as foundation. I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this. I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount. If that approach would be acceptable to lucene in general, then we just need two static parameters. However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Williams updated LUCENE-1052: --- Attachment: termInfosConfigurer.patch termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow dynamic management of this parameter. A new new interface, TermInfosConfigurer, allows specification of a method, getMaxTermsCached(), that bounds the size of the in-memory term infos as a function of the segment name, segment numDocs, and total segment terms. This bound is then used to automatically set termInfosIndexDivisor whenever a TermInfosReader reads the term index. This mechanism provides a simple way to ensure that the total amount of memory consumed by the term cache is bounded by, say, O(log(numDocs)). All Lucene core tests pass. I'm using another version of this same patch in Lucene 2.1+ in an application that has indexes with binary term pollution, using the TermInfosConfigurer to dynamically bound the term cache in the polluted segments. Tried to test contrib, but it appears gdata-server needs external libraries I don't have to compile. Michael, this patch applies cleanly to today's Lucene trunk. I'd appreciate if you could verify one thing. Lucene 2.3 has the incremental reopen mechanism (can't wait to get that!), new since Lucene 2.1. It appears that reopen of a segment reuses the same TermInfosReader and thus does not need to configure a new one. I've implemented that part of the patch with this assumption. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543383 ] Chuck Williams commented on LUCENE-1052: I believe this needs to be a formula as a reasonable bound on the number of terms is in general a function of the number of documents in the segment and the nature of the index (e.g., types of fields). A common thing to do would be to enforce that RAM usage for cached terms grows no faster than logarithmically in the number of documents. The specific formula that is appropriate will depend on the index, i.e. on the application. It might be of the form: c*ln(numdocs+k), wnere c and k are constants dependent on the index. One consequence of this approach, or any approach along these lines, is that the indexDivisor will vary across the segments, both in a single index and across indexes. It seems to me from the code that this should work fine. This leaves the issue of how to best specify an arbitrary formula. This requires a method to compute the max cached terms allowed for a segment based on the number of docs in the segment, the number of terms in the segment's index, and possibly other factors. The most direct way to do this is to introduce an interface, e.g. TermInfosConfigurer, to define the method signature, and to add setTermInfosConfigurer as an alternative to setTermInfosIndexDivisor. It would need to be in all the same places. A more general approach would be to introduce an IndexConfigurer class which over time could hold additional methods like this. It could even replace the current setters on IndexReader (as well as IndexWriter, etc.) with a more general mechanism that would allow dynamic parameters used to configure any classes in the index structure. Each constructor would be passed the IndexConfigurer and call getters or other methods on it to obtain its config. The methods could provide constant values or dynamic formulas. I'm going to implement the straightforward solution at the moment in our older version of Lucene, then will sync up to whatever you guys decide is best for the trunk later. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 Attachments: LUCENE-1052.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543306 ] Chuck Williams commented on LUCENE-1052: Michael, thanks for creating an excellent production version of this idea and committing it! I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front. The idea is to instead specify a maximum number of index terms to cache in memory. This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum. This seems a simple and extremely useful extension. Unfortunately, I'm still on an older Lucene, but will post my update. If you like this idea, you may want to just add the feature directly to your implementation in the trunk. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 Attachments: LUCENE-1052.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term pollution from binary data
Doug Cutting wrote on 11/07/2007 09:26 AM: Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead(). Here's a (totally untested) patch. Doug, thanks for this suggestion and your quick patch. I fleshed this out in the version of Lucene we are using, a bit after 2.1. There was an off-by-1 bug plus a few missing pieces. The attached patch is for 2.1+, but might be useful as it at least contains the corrections and missing elements. It also contains extensions to the tests to exercise the patch. I tried integrating this into 2.3, but enough has changed so that it was not straightforward (primarily for the test case extensions -- the implementation seems it will apply with just a bit of manual merging). Unfortunately, I have so many local changes that is has become difficult to track the latest Lucene. The task of syncing up will come soon. I'll post a proper patch against the trunk in jira at a future date if the issue is not already resolved before then. Michael McCandless wrote on 11/08/2007 12:43 AM: I'll open an issue and work through this patch. Michael, I did not see the issue, else would have posted this there. Unfortunately, I'm pretty far behind on lucene mail these days. One thing is: I'd prefer to not use system property for this, since it's so global, but I'm not sure how to better do it. Agree strongly that this is not global. Whether ctors or an index-specific properties object or whatever, it is important to be able to set this on some indexes and not others in a single application. Thanks for picking this up! Chuck Index: src/test/org/apache/lucene/index/DocHelper.java === --- src/test/org/apache/lucene/index/DocHelper.java (revision 2247) +++ src/test/org/apache/lucene/index/DocHelper.java (working copy) @@ -254,10 +254,25 @@ */ public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc) throws IOException { -DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50); -writer.addDocument(segment, doc); +writeDoc(dir, analyzer, similarity, segment, doc, IndexWriter.DEFAULT_TERM_INDEX_INTERVAL); } + /** + * Writes the document to the directory segment using the analyzer and the similarity score + * @param dir + * @param analyzer + * @param similarity + * @param segment + * @param doc + * @param termIndexInterval + * @throws IOException + */ + public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc, int termIndexInterval) throws IOException + { +DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50, termIndexInterval); +writer.addDocument(segment, doc); + } + public static int numFields(Document doc) { return doc.getFields().size(); } Index: src/test/org/apache/lucene/index/TestSegmentTermDocs.java === --- src/test/org/apache/lucene/index/TestSegmentTermDocs.java (revision 2247) +++ src/test/org/apache/lucene/index/TestSegmentTermDocs.java (working copy) @@ -25,6 +25,7 @@ import org.apache.lucene.document.Field; import java.io.IOException; +import org.apache.lucene.search.Similarity; public class TestSegmentTermDocs extends TestCase { private Document testDoc = new Document(); @@ -212,6 +213,23 @@ dir.close(); } + public void testIndexDivisor() throws IOException { +dir = new RAMDirectory(); +testDoc = new Document(); +DocHelper.setupDoc(testDoc); +DocHelper.writeDoc(dir, new WhitespaceAnalyzer(), Similarity.getDefault(), test, testDoc, 3); + +assertNull(System.getProperty(lucene.term.index.divisor)); +System.setProperty(lucene.term.index.divisor, 2); +try { + testTermDocs(); + testBadSeek(); + testSkipTo(); +} finally { + System.clearProperty(lucene.term.index.divisor); +} + } + private void addDoc(IndexWriter writer, String value) throws IOException { Document doc = new Document(); Index: src/test/org/apache/lucene/index/TestSegmentReader.java === --- src/test/org/apache/lucene/index/TestSegmentReader.java (revision 2247) +++ src/test/org/apache/lucene/index/TestSegmentReader.java (working copy) @@ -23,10 +23,12 @@ import java.util.List; import junit.framework.TestCase; +import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Fieldable; import org.apache.lucene.search.DefaultSimilarity; +import org.apache.lucene.search.Similarity; import
Term pollution from binary data
Hi All, We are experiencing OOM's when binary data contained in text files (e.g., a base64 section of a text file) is indexed. We have extensive recognition of file types but have encountered binary sections inside of otherwise normal text files. We are using the default value of 128 for termIndexInterval. The problem arises because binary data generates a large set of random tokens, leading to totalTerms/termIndexInterval terms stored in memory. Increasing the -Xmx is not viable as it is already maxed. Does anybody know of a better solution to this problem than writing some kind of binary section recognizer/filter? It appears that termIndexInterval is factored into the stored index and thus cannot be changed dynamically to work around the problem after an index has become polluted. Other than identifying the documents containing binary data, deleting them, and then optimizing the whole index, has anybody found a better way to recover from this problem? Thanks for any insights or suggestions, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1037) Corrupt index: term out of order after forced stop during indexing
Corrupt index: term out of order after forced stop during indexing --- Key: LUCENE-1037 URL: https://issues.apache.org/jira/browse/LUCENE-1037 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.0.1 Environment: Windows Server 2003 Reporter: Chuck Williams In testing a reboot during active indexing, upon restart this exception occurred: Caused by: java.io.IOException: term out of order (ancestorForwarders:.compareTo(descendantMoneyAmounts:$0.351) = 0) at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:96) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:322) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:289) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1398) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:835) at ... (application code) The ancestorForwarders: term has no text. The application never creates such a term. It seems the reboot occurred while this term was being written, but such a segment should not be linked into the index and so should not be visible after restart. The application uses parallel subindexes accessed with ParallelReader. This reboot caught the system in a state where the indexes were out of sync, i.e. a new document had parts indexed in one subindex but not yet indexed in another. The application detects this condition upon restart, uses IndexReader.deleteDocument() to delete the parts that were indexed from those subindexes, and then does optimize() all all the subindexes to bring the docid's back into sync. The optimize() failed, presumably on a subindex that was being written at the time of the reboot. This subindex would not have completed its document part and so no deleteDocument() would have been performed on it prior to the optimize(). The version of Lucene here is from January 2007. I see one other reference to this exception in LUCENE-848. There is a note there that the exception is likely a core problem, but I don't see any follow up to track it down. Any ideas how this could happen? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1, soon
How about a direct solution with a reference count scheme? Segments files could be reference-counted, as well as individual segments either directly, possibly by interning SegmentInfo instances, or indirectly by reference counting all files via Directory. The most recent checkpoint and snapshot would have an implicit reference since they can be opened. Each reader and writer creates a reference when it opens a segments file. This way segments files and each segment's files would be deleted precisely when they are no longer used, which would both support NFS and improve performance on Windows. Chuck Marvin Humphrey wrote on 01/18/2007 11:40 AM: I wrote: I'd be cool with making it impossible to put an index on an NFS volume prior to version 4. Elaborating and clarifying... IndexReader attempts to establish a read lock on the relevant segments_N file. It doesn't bother to see whether the locking attempt succeeds, though. IndexFileDeleter, before deleting any files, always touches a test file, attempts to lock it, and verifies that the lock succeeds. If the locking test fails, it throws an exception rather than proceed. In addition, the locking test is run at index creation time, so that the user knows as soon as possible that their index is in a problematic location. I think the only way this would fail under NFS is if the client machine with the reader is using NFS version 3, while the machine with the writer is using version 4. But before this issue arose I didn't have that much experience with the intricacies of NFS, so I could be off-base. This does bring back the permissions issue with IndexReader. A search app may not have permission to establish a read lock on a file within the index directory, and in that case, an IndexFileDeleter could delete files out from under it. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: adding explicit commits to Lucene?
I don't see how to do commits without at least some new methods. There needs to be some way to roll back changes rather than committing them. If the commit action is IndexWriter.close() (even if just an interface) the user still needs another method to roll back. There are reasons to close an IndexWriter other than committing changes, such as to flush all the ram segments to disk to free memory or save state. We now have IndexWriter.flushRamSegments() for this case, but are there others? As was already pointed out to delete documents you have to find them, which may require a reader accessing the current snapshot rather than the current checkpoint. There needs to be some way to specify this distinction. Chuck Yonik Seeley wrote on 01/17/2007 06:48 AM: On 1/17/07, Michael McCandless [EMAIL PROTECTED] wrote: If this approach works well we could at some point deprecate the delete* methods on IndexReader and make package protected versions that IndexWriter calls. If we do API changes in the future, it would be nice to make the search side more efficient w.r.t. deleted documents... at least remove the synchronization for isDeleted for read-only readers, and perhaps even have a subclass that is a no-op for isDeleted for read-only readers. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1, soon
Grant Ingersoll wrote on 01/17/2007 01:42 AM: Also, I'm curious as to how many people use NFS in live systems. I've got the requirement to support large indexes and collections of indexes on NAS devices, which from linux pretty much means NFS or CIFS. This doesn't seem unusual. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465240 ] Chuck Williams commented on LUCENE-756: --- I may have the only app that will be broken by the 10-day backwards incompatibility, but the change seems worth it. I need to create some large indexes to take on the road for demos. Is the index format in the latest patch final? Maintain norms in a single file .nrm Key: LUCENE-756 URL: https://issues.apache.org/jira/browse/LUCENE-756 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Doron Cohen Assigned To: Doron Cohen Priority: Minor Attachments: index.premergednorms.cfs.zip, index.premergednorms.nocfs.zip, LUCENE-756-Jan16.patch, LUCENE-756-Jan16.Take2.patch, nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher. By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format. More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: adding explicit commits to Lucene?
Yonik Seeley wrote on 01/16/2007 11:29 AM: On 1/16/07, robert engels [EMAIL PROTECTED] wrote: You have the same problem if there is an existing reader open, so what is the difference? You can't remove the segments there either. The disk space for the segments is currently removed if no one has them open... this is quite a bit different than guaranteeing that a reader in the future will be able to open an index in the past. To me the key benefit of explicit commits is that ongoing adds and their associated merges update only the segments of the current snapshot. The current snapshot can be aborted, falling back to the last checkpoint without having made any changes to its segments at all. Once a commit is done the committed snapshot becomes the new checkpoint. Lucene does not have this desirable property now even for adding a single document, since that document may cause a merge with consequences arbitrarily deep into the index. For the single-transaction use case it is only necessary that the segments in the current checkpoint and those in the current snapshot are maintained. Revising the current snapshot can delete segments in the prior snapshot, and committing can delete segments in the prior checkpoint. Of course support for multiple parallel transactions would be even better, but is also a huge can of worms as anyone who has spent time chasing database deadlocks and understanding all the different types of locks that modern databases use can attest. The single-transaction case seems straightforward to implement per Michael's suggestion and enables valuable use cases as the thread has enumerated. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: adding explicit commits to Lucene?
robert engels wrote on 01/15/2007 08:01 AM: Is your parallel adding code available? There is an early version in LUCENE-600, but without the enhancements described. I didn't update that version because it didn't capture any interest and requires Java 1.5 and so it seems will not be committed. I could update jira with the new version, but would have to create a clean patch that applies again the lucene head. My local copy is diverged due to a number of uncommitted patches and so patches generated from it contain other stuff. My use case for parallel subindexes is as an enabler for fast bulk updates. Only the subindexes containing changing fields need to be updated, so long as the update algorithm does not change doc-ids. Even though this requires rewriting entire segments using techniques similar to those used in merging (but not purging deleted docs), I'm still getting 30x (when many fields changed) to many hundreds-x (when only a few fields changing) faster update performance than the batched delete-add method on very large indexes (million of documents, some very large). Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: adding explicit commits to Lucene?
Ning Li wrote on 01/15/2007 06:29 PM: On 1/14/07, Michael McCandless [EMAIL PROTECTED] wrote: * The support deleteDocuments in IndexWriter (LUCENE-565) feature could have a more efficient implementation (just like Solr) when autoCommit is false, because deletes don't need to be flushed until commit() is called. Whereas, now, they must be aggressively flushed on each checkpoint. If a reader can only open snapshots both for search and for modification, I think another change is needed besides the ones listed: assume the latest snapshot is segments_5 and the latest checkpoint is segmentsx_7 with 2 new segments, then a reader opens snapshot segments_5, performs a few deletes and writes a new checkpoint segmentsx_8. The summary file segmentsx_8 should include the 2 new segments which are in segmentsx_7 but not in segments_5. Such segments to include are easily identifiable only if they are not merged with segments in the latest snapshot... All these won't be necessary if a reader always opens the latest checkpoint for modification, which will also support deletion of non-committed documents. This problem seems worse. I don't see how a reader and a writer can independently compute and write checkpoints. The adds in the writer don't just create new segments, they replace existing ones through merging. And the merging changes doc-ids by expunging deletes. It seems that all deletes must be based on the most recent checkpoint, or merging of checkpoints to create the next snapshot will be considerably more complex. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: adding explicit commits to Lucene?
robert engels wrote on 01/15/2007 08:11 PM: If that is all you need, I think it is far simpler: If you have an OID, then al that is required is to a write to a separate disk file the operations (delete this OID, insert this document, etc...) Once the file is permanently on disk. Then it is simple to just keep playing the file back until it succeeds. There is no guarantee a given operation will ever succeed so this doesn't work. This is what we do in our search server. I am not completely familiar with parallel reader, but in reading the JavaDoc I don't see the benefit - since you have to write the documents to both indexes anyway??? Why is it of any benefit to break the document into multiple parts? I'm sure Doug had reasons to write it. My reason to use it is for fast bulk updates, updating one subindex without having to update the others. If you have OIDs available, parallel reader can be accomplished in a far simpler and more efficient manner - we have a completely federated server implementation that was trivial - less 100 lines of code. We did it simpler, and create a hash from the OID, and store the document into a different index depending on the has, then run the query across all indexes in parallel, joining the results. Lucene has this built in via MultiSearcher and RemoteSearchable. It is a bit more complex due to the necessity to normalize Weights, e.g. to ensure the same docFreq's which reflect the union of all indexes are used for the search in each. Federated searching addresses different requirements than ParallelReader. Yes, I agree that ParallelReader could be done using UID's, but believe it would be a considerably more expensive representation to search. The method used in federated search to distribute the same query to each index is not applicable. Breaking the query up into parts that are applied against each parallel index, with each query part referencing only the fields in a single parallel index, would be a challenge with complex nested queries supporting all of the operators, and much less efficient than ParallelReader. Modifying all the primitive Query subclasses to use UID's instead of doc-ids's would be an alternative, but would be a lot of work and not nearly as efficient as the existing Lucene index representation that sorts postings by doc-id. To illustrate this, consider the simple query, f:a AND g:b, where f and g are in two different parallel indexes. Performing the f and g queries separately on the different indexes to get possibly very long lists of results and then joining those by UID will be much slower than BooleanQuery operating on ParallelReader with doc-id sorted postings. The alternative of a UID-based BooleanQuery would have similar challenges unless the postings were sorted by UID. But hey, that's permanent doc-ids. Chuck On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote: My interest is transactions, not making doc-id's permanent. Specifically, the ability to ensure that a group of adds either all go into the index or none go into the index, and to ensure that if none go into the index that the index is not changed in any way. I have UID's but they cannot ensure the latter property, i.e. they cannot ensure side-effect-free rollbacks. Yes, if you have no reliance on internal Lucene structures like doc-id's and segments, then that shouldn't matter. But many capabilities have such reliance for good reasons. E.g., ParallelReader, which is a public supported class in Lucene, requires doc-id synchronization. There are similar good reasons for an application to take advantage of doc-ids. Lucene uses doc-id's in many of its API's and so it is not surprising that many applications rely on them, and I'm sure misuse them not fully understanding the semantics and uncertainties of doc-id changes due to merging segments with deletes. Applications can use doc-ids for legitimate and beneficial purposes while remaining semantically valid. Making such capabilities efficient and robust in all cases is facilitated by application control over when doc-id's and segment structure change at a granularity larger than the single Document. If I had a vote it would be +1 on the direction Michael has proposed, assuming it can be done robustly and without performance penalty. Chuck robert engels wrote on 01/15/2007 07:34 PM: I honestly think that having a unique OID as an indexed field and putting a layer on top of Lucene is the best solution to all of this. It makes it almost trivial, and you can implement transaction handling in a variety of ways. Attempting to make the doc ids permanent is a tough challenge, considering the orignal design called for them to be non permanent. It seems doubtful that you cannot have some sort of primary key any way and be this concerned about the transactional nature of Lucene. I vote -1 on all of this. I think it will detract from the simple and efficient storage
[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464055 ] Chuck Williams commented on LUCENE-769: --- Robert, Could you attach your current implementation of reopen() as well? The attachment did not come through in your java-dev message today, or the one from 12/11. I'd like to look at an incremental implementation of reopen() for FieldCache. Thanks [PATCH] Performance improvement for some cases of sorted search --- Key: LUCENE-769 URL: https://issues.apache.org/jira/browse/LUCENE-769 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Artem Vasiliev Attachments: DocCachingSorting.patch, DocCachingSorting.patch, QueryFilter.java, StoredFieldSorting.patch It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble. The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC. Their fields values are then fetched from the cache and compared while sorting. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463729 ] Chuck Williams commented on LUCENE-769: --- The test case uses only tiny documents, and the reported timings for multiple searches with FieldCache make it appear that the version of lucene used contains the bug that caused FieldCaches to be frequently recomputed unnecessarily. I suggest trying the test with much larger documents, of realistic size, and using current Lucene source. I'm sure the patch will make things much slower with the current implementation. As Hoss suggests, performance would be improved considerably by using a FieldSelector to obtain just the sort field, but even so will be slow unless the sort field is arranged to be early on the documents, ideally the first field, and a LOAD_AND_BREAK FieldSelector is used. Another important performance variable will be the number of documents retrieved in the test query. If the number of documents satisfying the query is a sizable percentage of the total collection size, I'm pretty sure the patch will be much slower than using FieldCache. [PATCH] Performance improvement for some cases of sorted search --- Key: LUCENE-769 URL: https://issues.apache.org/jira/browse/LUCENE-769 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Artem Vasiliev Attachments: DocCachingSorting.patch, DocCachingSorting.patch It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble. The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC. Their fields values are then fetched from the cache and compared while sorting. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322 ] Chuck Williams commented on LUCENE-767: --- Isn't maxDoc always the same as the docCount of the segment, which is stored? I.e., couldn't SegmentReader.maxDoc() be equivalently defined as: public int maxDoc() { return si.docCount; } Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a reader never changes numDocs or maxDoc, it seems to me these values should always be the same. All Lucene tests pass with this definition. I have code that relies on this equivalence and so would appreciate knowledge of any case where this equivalence might not hold. maxDoc should be explicitly stored in the index, not derived from file length - Key: LUCENE-767 URL: https://issues.apache.org/jira/browse/LUCENE-767 Project: Lucene - Java Issue Type: Improvement Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor This is a spinoff of LUCENE-140 In general we should rely on as little as possible from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead. Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes
[ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122 ] Chuck Williams commented on LUCENE-510: --- Has an improvement been made to eliminate the reported 20% indexing hit? That would be a big price to pay. To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance. A 20% hit seems suprising. The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field). In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end. I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed. The vint representation still works if more bytes than necessary are written. IndexOutput.writeString() should write length in bytes -- Key: LUCENE-510 URL: https://issues.apache.org/jira/browse/LUCENE-510 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.1 Reporter: Doug Cutting Assigned To: Grant Ingersoll Fix For: 2.1 Attachments: SortExternal.java, strings.diff, TestSortExternal.java We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters. This issue has been discussed at: http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html We must increment the file format number to indicate this change. At least the format number in the segments file should change. I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values
[ http://issues.apache.org/jira/browse/LUCENE-762?page=comments#action_12461460 ] Chuck Williams commented on LUCENE-762: --- Hi Grant, Maybe even better would be to have an appropriate method on FieldSelectorResult. E.g.: FieldSelectorResult.readField(doc, fieldsStream, fi, binary, compressed, tokenized) This would eliminate the tests or map lookup in performance-critical code. [PATCH] Efficiently retrieve sizes of field values -- Key: LUCENE-762 URL: http://issues.apache.org/jira/browse/LUCENE-762 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.1 Reporter: Chuck Williams Assigned To: Grant Ingersoll Priority: Minor Attachments: SizeFieldSelector.patch Sometimes an application would like to know how large a document is before retrieving it. This can be important for memory management or choosing between algorithms, especially in cases where documents might be very large. This patch extends the existing FieldSelector mechanism with two new FieldSelectorResults: SIZE and SIZE_AND_BREAK. SIZE creates fields on the retrieved document that store field sizes instead of actual values. SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the document size (e.g., the body field) and can thus be used as a reasonable size approximation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values
[PATCH] Efficiently retrieve sizes of field values -- Key: LUCENE-762 URL: http://issues.apache.org/jira/browse/LUCENE-762 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.1 Reporter: Chuck Williams Sometimes an application would like to know how large a document is before retrieving it. This can be important for memory management or choosing between algorithms, especially in cases where documents might be very large. This patch extends the existing FieldSelector mechanism with two new FieldSelectorResults: SIZE and SIZE_AND_BREAK. SIZE creates fields on the retrieved document that store field sizes instead of actual values. SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the document size (e.g., the body field) and can thus be used as a reasonable size approximation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values
[ http://issues.apache.org/jira/browse/LUCENE-762?page=all ] Chuck Williams updated LUCENE-762: -- Attachment: SizeFieldSelector.patch [PATCH] Efficiently retrieve sizes of field values -- Key: LUCENE-762 URL: http://issues.apache.org/jira/browse/LUCENE-762 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.1 Reporter: Chuck Williams Attachments: SizeFieldSelector.patch Sometimes an application would like to know how large a document is before retrieving it. This can be important for memory management or choosing between algorithms, especially in cases where documents might be very large. This patch extends the existing FieldSelector mechanism with two new FieldSelectorResults: SIZE and SIZE_AND_BREAK. SIZE creates fields on the retrieved document that store field sizes instead of actual values. SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the document size (e.g., the body field) and can thus be used as a reasonable size approximation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance
[ http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459763 ] Chuck Williams commented on LUCENE-754: --- Cool! This should solve at least part of my problem. Trying this now (along with finalizer removal patch that is already installed here). Will report back results. Thanks! FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance - Key: LUCENE-754 URL: http://issues.apache.org/jira/browse/LUCENE-754 Project: Lucene - Java Issue Type: Bug Reporter: Yonik Seeley Assigned To: Yonik Seeley Attachments: FieldCache.patch -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance
[ http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459791 ] Chuck Williams commented on LUCENE-754: --- This patch, together with LUCENE-750 (already committed) solved our problem completely. It sped up simultaneous multi-threaded searches with a new ParallelReader against a 1 million item investigation that has a unique id sort field (i.e., 1 million entry FIeldCache must be created) by a factor of 15x. Thanks Yonik! +1 to commit this. FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance - Key: LUCENE-754 URL: http://issues.apache.org/jira/browse/LUCENE-754 Project: Lucene - Java Issue Type: Bug Reporter: Yonik Seeley Assigned To: Yonik Seeley Attachments: FieldCache.patch -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 15 minute hang in IndexInput.clone() involving finalizers
The problem appears to be this. We have an approximately 1 million item index. It uses 6 parallel subindexes with ParallelReader, so each of these subindexes has 1 million items. Each subindex has the same segment structure, with 15 segments in each at the moment. I mentioned before that the issue arose just after a deleteAdd update that closed the reader after the deletes, added with the writer, and then reopened the reader. We have been using a default sort that looks at score first and then id of the item. Each id is unique, with an integer sort field. So the query just after the IndexReader refresh has to create a new FieldCache comparator for this integer field. That generates a ParallelTermDocs that iterates the id field, which is of course in only one of the subindexes. So we have to build a field cache with 1,000,000 entries, which requires cloning the freqStream in the SegmentReader for each segment. This should only be 15 clones as I interpret the code. There were 4 threads doing this simultaneously, so make that 60 clones. I can see reading 1 million terms and building the comparator taking a while, although not the 15-20 minutes it does, and am baffled at how every thread dump on many trials of this issue end up with every one inside the clone()! The clone just doesn't do much, the most expensive thing being copying the 1024 byte buffer in BufferedIndexInput. Applying the patch moved the issue somewhat, but not materially. The setup of the FieldCache comparator still takes the same amount of time and all thread dumps still find the stack inside Object.clone() working on finalizers. I'll study this further and look for an optimization, submitting a patch if I find one. One interesting thing is that it appears that all 4 threads simultaneously doing this query are building a field cache. It seems the synchronization in FieldCacheIImpl.get() with the CreationPlaceHolder should have prevented that, but for some reason it is not. Any further suggestions would be welcome! For easy access, here is the thread dump again without the patch: == Thread Connection thread group.HttpConnection-26493-7 === java.lang.ref.Finalizer.add(Unknown Source) java.lang.ref.Finalizer.init(Unknown Source) java.lang.ref.Finalizer.register(Unknown Source) java.lang.Object.clone(Native Method) org.apache.lucene.store.IndexInput.clone(IndexInput.java:175) org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128) org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:562) org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45) org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:333) org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:416) org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:409) org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:361) org.apache.lucene.index.ParallelReader$ParallelTermDocs.next(ParallelReader.java:353) org.apache.lucene.search.FieldCacheImpl$3.createValue(FieldCacheImpl.java:173) org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:154) org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:148) org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:204) org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:175) org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:155) org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:41) And here is the top of the stack with the patch (rest is the same): == Thread Connection thread group.HttpConnection-26493-3 === java.lang.ref.Finalizer.init(Unknown Source) java.lang.ref.Finalizer.register(Unknown Source) java.lang.Object.clone(Native Method) org.apache.lucene.store.IndexInput.clone(IndexInput.java:175) org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128) org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:564) org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45) Thanks, Chuck Chuck Williams wrote on 12/15/2006 08:22 AM: Yonik and Robert, thanks for the suggestions and pointer to the patch! We've looked at the synchronization involved with finalizers and don't see how it could cause the issue as running the finalizers themselves is outside the lock. The code inside the lock is simple fixed-time list manipulation, not even a loop. On the other hand, we don't see how anything else could cause
15 minute hang in IndexInput.clone() involving finalizers
Hi All, I've had a bizarre anomaly arise in an application and am wondering if anybody has ever seen anything like this. Certain queries, in not easy to reproduce cases, take 15-20 minutes to execute rather than a few seconds. The same query is fast some times and anomalously slow others. This is on a 1,000,000 document collection, but the problem seems independent of that. I took a bunch of thread dumps during the anomaly period. There are 4 threads executing the same query at the same time, and all 4 appear to spend almost the entire time trying to register finalizers as part of cloning an IndexInput within an application call to create a TopFieldDocCollector into which the results will be collected. The actual search has not been launched yet, and will be reasonably quick when it is. All 4 threads show this unchanging stack trace during the 15-20 minutes: == Thread Connection thread group.HttpConnection-26493-11 === java.lang.ref.Finalizer.add(Unknown Source) java.lang.ref.Finalizer.init(Unknown Source) java.lang.ref.Finalizer.register(Unknown Source) java.lang.Object.clone(Native Method) org.apache.lucene.store.IndexInput.clone(IndexInput.java:175) org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128) org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:562) org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45) org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:333) org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:416) org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:409) org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:361) org.apache.lucene.index.ParallelReader$ParallelTermDocs.next(ParallelReader.java:353) org.apache.lucene.search.FieldCacheImpl$3.createValue(FieldCacheImpl.java:173) org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:154) org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:148) org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:204) org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:175) org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:155) org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:41) ... application stack Another factor appears to be that this anomaly usually (maybe always) happens just after a series of deleteAdd updates, i.e. just after a series of deleting with the IndexReader, closing it to add a modified version of that document with the IndexWriter, and then reopening the IndexReader. A query just after reopening the IndexReader is most likely to trigger this issue. I have not seen this problem on any other collecitons with the same application, and so it may be specific to this collection or to its size. Any thoughts or ideas would be appreciated. Thanks, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locale string compare: Java vs. C#
Surprising but it looks to me like a bug in Java's collation rules for en-US. According to http://developer.mimer.com/collations/charts/UCA_latin.htm, \u00D8 (which is Latin Capital Letter O With Stroke) should be before U, implying -1 is the correct result. Java is returning 1 for all strengths of the collator. Maybe there is some other subtlety with this character... Chuck George Aroush wrote on 12/13/2006 04:20 PM: Hi folks, Over at Lucene.Net, I have run into a NUnit test which is failing with Lucene.Net (C#) but is passing with Lucene (Java). The two tests that fail are: TestInternationalMultiSearcherSort and TestInternationalSort After several hours of investigation, I narrowed the problem to what I believe is a difference in the way Java and .NET implement compare. The code in question is this method (found in FieldSortedHitQueue.java): public final int compare (final ScoreDoc i, final ScoreDoc j) { return collator.compare (index[i.doc], index[j.doc]); } To demonstrate the compare problem (Java vs. .NET) I crated this simple code both in Java and C#: // Java code: you get back 1 for 'res' String s1 = H\u00D8T; String s2 = HUT; Collator collator = Collator.getInstance (Locale.US); int diff = collator.compare(s1, s2); // C# code: you get back -1 for 'res' string s1 = H\u00D8T; string s2 = HUT; System.Globalization.CultureInfo locale = new System.Globalization.CultureInfo(en-US); System.Globalization.CompareInfo collator = locale.CompareInfo; int res = collator.Compare(s1, s2); Java will give me back a 1 while .NET gives me back -1. So, what I am trying to figure out is who is doing the right thing? Or am I missing additional calls before I can compare? My goal is to understand why the difference exist and thus based on that understanding I can judge how serious this issue is and find a fix for it or just document it as a language difference between Java and .NET. Btw, this is based on Lucene 2.0 for both Java and C# Lucene. Regards, -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Efficiently expunging deletions of recently added documents
Thanks Ning. This is all very helpful. I'll make sure to be consistent with the new merge policy and its invariant conditions. Chuck Ning Li wrote on 12/05/2006 08:01 AM: An old issue (http://issues.apache.org/jira/browse/LUCENE-325 new method expungeDeleted() added to IndexWriter) requested a similar functionality as described in the latter half of your email. The patch for that issue breaks the invariants of the new merge policy. An algorithm similar to that of addIndexesNoOptimize() (http://issues.apache.org/jira/browse/LUCENE-528 Optimization for IndexWriter.addIndexes()) would solve the problem. Ning On 12/5/06, Ning Li [EMAIL PROTECTED] wrote: I'd like to open up the API to mergeSegments() in IndexWriter and am wondering if there are potential problems with this. I'm worried that opening up mergeSegments() could easily break the invariants currently guaranteed by the new merge policy(http://issues.apache.org/jira/browse/LUCENE-672). The two invariants say that if M does not change and segment doc count is not reaching maxMergeDocs: B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f(x) = f(y). 2: The number of committed segments on the same level (f(n)) = M. Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted
Mike Klaas wrote on 12/05/2006 11:38 AM: On 12/5/06, negrinv [EMAIL PROTECTED] wrote: Chris Hostetter wrote: If the code was not already in the core, and someone asked about adding it I would argue against doing so on the grounds that some helpfull utility methods (possibly in a contrib) would be just as usefull, and would have no performance cost for people who don't care about compression. Perhaps, if you look at compression on its own, but once you see compression in the context of all the other field options it makes sense to have it added to Lucene, it's about having everything in one place for ease of implementation that offsets the performance issue, in my opinion. Note that built-in compression is deprecated, for similar reasons as are being given for the encrypted fields. Built-in compression is also memory-hungry and slow due to the copying it does. External compression is much faster, especially if you extend Field binary values to support a binary length parameter (which I submitted a patch for a long time ago). Here is another argument against adding Field encryption to the lucene core. Changes in index format make life complex for any implementations that deal with index files directly. There are a number of Lucene sister projects that do this, plus a number of applications. I have a fast bulk updater that directly manipulates index files and am busy upgrading it right now to the 2.1 index format with lockless commits (which is not fully documented in the new index file formats, by the way, e.g. the segmentN.sM separate norm files are missing). It's a pain. In general, I think changes to Lucene index format should only be driven by compelling benefits. Moving encryption from external to internal to get a minor application simplification is not sufficiently compelling to me. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Efficiently expunging deletions of recently added documents
Hi All, I'd like to open up the API to mergeSegments() in IndexWriter and am wondering if there are potential problems with this. I use ParallelReader and ParallelWriter (in jira) extensively as these provide the basis for fast bulk updates of small metadata fields. ParallelReader requires that the subindexes be strictly synchronized by matching doc ids. The thorniest problem arises when writing a new document (with ParallelWriter) generates an exception in some of the subindexes but not others, as this leaves the subindexes out of sync. I have recovery for this now that works by deleting the successfully added subdocuments that are parallel to any unsuccessful subdocument and then optimizing to expunge the unsuccessful doc-id from those segments where it had been added. Optimization is prohibitively expensive for large indexes, and unnecessary for this recovery. A much better solution is to have an API in IndexWriter to expunge a given set of deleted doc ids. This could merge only enough recent segments to fully encompass the specified docs, which in this case is not much since they will be recently added. The result should be orders of magnitude performance improvement to the recovery. I'm planning to make this change and submit a patch for it unless I've missed something that somebody can point out. At the same time, I'll update the ParallelWriter submission as there are a number of bug fixes plus a substantial general (non-recovery-case) performance improvement I've just identified and am about to implement. Thanks for any thoughts. suggestions, or problems you can point out. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
Michael Busch wrote on 11/22/2006 08:47 AM: Ning Li wrote: A possible design could be: First, in addDocument(), compute the byte size of a ram segment after the ram segment is created. In the synchronized block, when the newly created segment is added to ramSegmentInfos, also add its byte size to the total byte size of ram segments. Then, in maybeFlushRamSegments(), either one of two conditions can trigger a flush: number of ram segments reaching maxBufferedDocs, and total byte size of ram segments exceeding a threshold. There is a flaw in this approach as you exceed the threshold before flushing. With very large documents, that can cause an OOM. This is exactly how I implemented it in my private version a couple of weeks ago. It works good and I don't see performance problems with this design. I named the new parameter in IndexWriter: setMaxBufferSize(long). I implemented it externally because I need to check the size before adding a new document. To make this work, I have a notion of size of Document (via a Sized interface). I agree that it would be better to do this in IndexWriter, but more machinery would be needed. Lucene would need to estimate the size of the new ram segment and check the threshold prior to consuming the space. The API that Yonik committed last night (thanks Yonik!) provides the flexibility to address both use cases. It's a tiny bit more work for the app, but at least in my case, is necessary to tune for best performance (by minimizing memory usage variance as a function of size parameters) and avoid OOM's. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch This one should be golden as it addresses all the issues that have been raised and I believe the syncrhonization is fairly well optimized. Size is now computed based on buffer size, and so is a more accurate accounting of actual memory usage. I've added all the various checking and FileNotFoundExceptions that Doug suggested. I've also changed RamFile.buffers to an ArrayList per Yonik's last suggestion. This is probably better than cosmetic since it does allow some unnecessary syncrhonization to be eliminated. Unfortunately, my local Lucene differs now fairly substantially from the head -- wish you guys would commit more of my patches so merging wasn't so difficult :-) -- so I'm not using the version submitted here, but I did merge it into the head carefully and all tests pass, including the new RAMDIrectory tests specifically for the functionality this patch provides. [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-723) QueryParser support for MatchAllDocs
[ http://issues.apache.org/jira/browse/LUCENE-723?page=comments#action_12451849 ] Chuck Williams commented on LUCENE-723: --- +1 With this could also come negative-only queries, e.g. -foo as a shortcut for *:* -foo QueryParser support for MatchAllDocs Key: LUCENE-723 URL: http://issues.apache.org/jira/browse/LUCENE-723 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Affects Versions: 2.0.0 Reporter: Yonik Seeley Assigned To: Yonik Seeley Priority: Minor It seems like there really should be QueryParser support for MatchAllDocsQuery. I propose *:* (brings back memories of DOS :-) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch I've just attached my version of this patch. It includes a multi-threaded test case. I believe it is sound. A few notes: 1. Re. Yonik's comment about my synchronization scenario. Synhronizing as described does resolve the issue. No higher level synchronization is requried. It doesn't matter how concurent operations on the directory are ordered or intereleaved, so long as any computation that does a loop sees some instance of the directory that corresponds to its actual content at any polnt in time. The result of the loop will then be accurate for that instant. 2. Lucene has this same syncrhonization bug today in RAMDIrectory.list(). It can return a list of files that never comprised the contents of the directory. This is fixed in the attached. 3. Also, the long synchronization bug exists in RAMDirectory.fileModified() as well as RAMDIrectory.fileLength() since both are public. These are fixed in the attached. 4. I moved the synchronization off of the Hashtable (replacing it with a HashMap) up to the RAMDirectory as there are some operations that require synchronization at the directory level. Using just one lock seems better. As all Hashtable operations were already synchonized, I don't believe any material additional synchronization is added. 5. Lucene currently make the assumption that if a file is being written by a stream then no other streams are simultaneously reading or writing it. I've maintained this assumption as an optimization, allowing the streams to access fields directly without syncrhonization. This is documented in the comments, as is the locking order. 5. sizeInBytes is now maintained incrementally, efficiently. 6. Yonik, your version (which I just now saw) has a bug in RAMDIrectory.renameFile(). The to file may already exist, in which case it is overwritten and it's size must be subtracted. I actually hit this in my test case for my implementation and fixed it (since Lucene renames a new version of the segments file). All Lucene tests, including the new test, pass. Some contrib tests fail, I believe none of these failures are in any way related to this patch. [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450894 ] Chuck Williams commented on LUCENE-709: --- I didn't see Yonik's new version or comments until after my attach. Throwing IOExceptions when files that should exist don't is clearly a good thing. I'll add that to mine if you guys decide it is the one you would like to use. Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material. There are other inaccuracies as well (non-file-storage space in the RAMFiles and RAMDIrectory). If you guys decide to go with Yonik's version, I think my test case should still be used, and that the other synchronization errors I've fixed should be fixed (e.g., RAMDIrectory.list()). [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450260 ] Chuck Williams commented on LUCENE-709: --- Not synchronizing on the Hashtable, even if using an Enumerator, creates problems as the contents of the hash table may change during the sizeInBytes() iteration. Files might be deleted and/or added to the directory concurrently, causing the size to be computed from an invalid intermediate state. Using an Enumerator would cause the invalid value to be returned without an exception, while using an Iterator instead generates a ConcurrentModificationException. Synchronizing on files avoids the problem altogether without much cost as the loop is fast. Hashtable uses a single class, Hashtable.Enumerator, for both its iterator and its enumerator. There are a couple minor differences in the respective methods, such as the above, but not much. The issue with RAMFile.length being a long is an issue, but, this bug already exists in Lucene without sizeInBytes(). See RAMDirectory.fileLength(), which has the same problem now. I'll submit another verison of the patch that encapsulates RAMFile.length into a sychronized getter and setter. It's only used in a few places (RAMDIrectory, RAMInputStream and RAMOutputStream). [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450301 ] Chuck Williams commented on LUCENE-709: --- I hadn' t considered the case of such large values for maxBufferedDocs, and agree that the loop execution time is non-trivial in such cases. Incremental management of the size seems most important, especially considering that this will also eliminate the cost of the synchronization. I still think the syncrhonization adds safety since it guarantees that the loop sees a state of the directory that did exist at some time. At that time, the directory did have the reported size. Without the synchronization the loop may compute a size for a set of files that never comprised the contents of the directory at any instant. Consider this case: 1. Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files. Thread 1 then adds a second document, creating new segment index files. 2. Thread 2 is computing sizeInBytes and happens to see a state where all the new files from both the first and second documents are added, but the deletions are not seen. This could happen if the deleted files happen to be earlier in the hash array than the added files for either document. In this case sizeInBytes() without the synchronization computes a larger size for the directory than ever actually existed. Re. RAMDIrectory.fileLength(), it is not used within Lucene at all, but it is public, and the restriction that is not valid when index operations are happening concurrently is not specified. I think that is a bug. I'll rethink the patch based on your observations, Yonik, and resubmit. Thanks. [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallelMultiSearcher reimplementation
Doug Cutting wrote on 11/13/2006 10:50 AM: Chuck Williams wrote: I followed this same logic in ParallelWriter and got burned. My first implementation (still the version submitted as a patch in jira) used dynamic threads to add the subdocuments to the parallel subindexes simultaneously. This hit a problem with abnormal native heap OOM's in the jvm. At first I thought it was simply a thread stack size / java heap size configuration issue, but adjusting these did not resolve the issue. This was on linux. ps -L showed large numbers of defunct threads. jconsole showed enormous growing total-ever-allocated thread counts. I switched to a thread pool and the issue went away with the same config settings. Can you demonstrate the problem with a standalone program? Way back in the 90's I implemented a system at Excite that spawned one or more Java threads per request, and it ran for days on end, handling 20 or more requests per second. The thread spawning overhead was insignificant. That was JDK 1.2 on Solaris. Have things gotten that much worse in the interim? Today Hadoop's RPC allocates a thread per connection, and we see good performance. So I certainly have counterexamples. Are you pushing memory to the limit? In my case, we need a maximally sized Java heap (about 2.5G on linux) and so carefully minimize the thread stack and perm space sizes. My suspicion is that it takes a while after a thread is defunct before all resources are reclaimed. We are hitting our server with 50 simultaneous threads doing indexing, each of which writes 6 parallel subindexes in a separate thread. This yields hundreds of threads created per second in tight total thread stack space; the process continually bumped over the native heap limit. With the change to thread pools, and therefore no dynamic creation and destruction of thread stacks, all works fine. Unless you are running with a maximal Java heap, you are unlikely to have the issue as there is plenty of space left over for the native heap, so a delay in thread stack reclamation would yield a larger average process size, but would not cause OOM's. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12448923 ] Chuck Williams commented on LUCENE-709: --- Mea Culpa! Bad bug on my part. Thanks for spotting it! I believe the solution is simple. RAMDirectory.files is a Hashtable, i.e. it is synchronized. Hashtable.values() tracks all changes to the ram directory as they occur. The fail-fast iterator does not accept concurrent modificaitons. So, the answer is to stop concurrent modifications during sizeInBytes(). This is accomplised by synchronizing on the same objects as the modificaitons already use, i.e. files. I'm attaching a new version of the the patch that I believe is correct. Please emabarass me again if there is another mistake! [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Dynamically varying maxBufferedDocs
Hi All, Does anybody have experience dynamically varying maxBufferedDocs? In my app, I can never truncate docs and so work with maxFieldLength set to Integer.MAX_VALUE. Some documents are large, over 100 MBytes. Most documents are tiny. So a fixed value of maxBufferedDocs to avoid OOM's is too small for good ongoing performance. It appears to me that the merging code will work fine if the initial segment sizes vary. E.g., a simple solution is to make IndexWriter.flushRamSegments() public and manage this externally (for which I already have all the needed apparatus, including size information, the necessary thread synchronization, etc.). A better solution might be to build a size-management option into the maxBufferedDocs mechanism in lucene, but at least for my purposes, that doesn' t appear necessary as a first step. My main concern is that the mergeFactor escalation merging logic will somehow behave poorly in the presence of dynamically varying initial segment sizes. I'm going to try this now, but am wondering if anybody has tried things along these lines and might offer useful suggestions or admonitions. Thanks for any advice, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dynamically varying maxBufferedDocs
Thanks Yonik! Poor wording on my part. I won't vary maxBufferedDocs, just am making flushRamSegments() public and calling it externally (properly synchronized), earlier than it would otherwise be called from ongoing addDocument-driven merging. Sounds like this should work. Chuck Yonik Seeley wrote on 11/09/2006 08:37 AM: On 11/9/06, Chuck Williams [EMAIL PROTECTED] wrote: My main concern is that the mergeFactor escalation merging logic will somehow behave poorly in the presence of dynamically varying initial segment sizes. Things will work as expected with varying segments sizes, but *not* varying maxBufferedDocuments. The level of a segment is defined by maxBufferedDocuments. If there were a solution to flush early w/o maxBufferedDocuments changing, things would work fine. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dynamically varying maxBufferedDocs
Yonik Seeley wrote on 11/09/2006 08:50 AM: For best behavior, you probably want to be using the current (svn-trunk) version of Lucene with the new merge policy. It ensures there are mergeFactor segments with size = maxBufferedDocs before triggering a merge. This makes for faster indexing in the presence of deleted docs or partially full segments. I've got quite a few local patches unfortunately. It will take a while to sync up. If I don't already have this new logic, can I pick it up by just merging with the latest IndexWriter or are the changes more extensive? Thanks again, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dynamically varying maxBufferedDocs
Chuck Williams wrote on 11/09/2006 08:55 AM: Yonik Seeley wrote on 11/09/2006 08:50 AM: For best behavior, you probably want to be using the current (svn-trunk) version of Lucene with the new merge policy. It ensures there are mergeFactor segments with size = maxBufferedDocs before triggering a merge. This makes for faster indexing in the presence of deleted docs or partially full segments. I've got quite a few local patches unfortunately. It will take a while to sync up. If I don't already have this new logic, can I pick it up by just merging with the latest IndexWriter or are the changes more extensive? I must already have the new merge logic as the only diff between my IndexWriter and latest svn is the change just made to make flushRamSegments public. Yonik, thanks for your help. This should work well! Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dynamically varying maxBufferedDocs
This sounds good. Michael, I'd love to see your patch, Chuck Michael Busch wrote on 11/09/2006 09:13 AM: I had the same problem with large documents causing memory problems. I solved this problem by introducing a new setting in IndexWriter setMaxBufferSize(long). Now a merge is either triggered when bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs = maxBufferSize. I made these changes based on the new merge policy Yonik mentioned, so if anyone is interested I could open a Jira issue and submit a patch. - Michael Yonik Seeley wrote: On 11/9/06, Chuck Williams [EMAIL PROTECTED] wrote: Thanks Yonik! Poor wording on my part. I won't vary maxBufferedDocs, just am making flushRamSegments() public and calling it externally (properly synchronized), earlier than it would otherwise be called from ongoing addDocument-driven merging. Sounds like this should work. Yep. For best behavior, you probably want to be using the current (svn-trunk) version of Lucene with the new merge policy. It ensures there are mergeFactor segments with size = maxBufferedDocs before triggering a merge. This makes for faster indexing in the presence of deleted docs or partially full segments. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dynamically varying maxBufferedDocs
Michael Busch wrote on 11/09/2006 09:56 AM: This sounds good. Michael, I'd love to see your patch, Chuck Ok, I'll probably need a few days before I can submit it (have to code unit tests and check if it compiles with the current head), because I'm quite busy with other stuff right now. But you will get it soon :-) I've just written my patch and will submit it too once it is fully tested. I took this approach: 1. Add sizeInBytes() to RAMDirectory 2. Make flushRamSegments() plus new numRamDocs() and ramSizeInBytes() public in IndexWriter This does not provide the facility in IndexWriter, but it does provide a nice api to manage this externally. I didn't do it in IndexWriter for two reasons: 1. I use ParallelWriter, which has to manage this differently 2. There is no general mechanism in lucene to size documents. I use have an interface for my readers in reader-valued fields to support this. In general, there are things the application knows that lucene doesn't know that help to manage the size bounds Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch [PATCH] Enable application-level management of IndexWriter.ramDirectory size Key: LUCENE-709 URL: http://issues.apache.org/jira/browse/LUCENE-709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams Attachments: ramDirSizeManagement.patch IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value. This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors. The actual job of managing to a size constraint, or any other constraint, is left up the applicatation. The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallelMultiSearcher reimplementation
Doug Cutting wrote on 11/03/2006 12:18 PM: Chuck Williams wrote: Why would a thread pool be more controversial? Dynamically creating and garbaging threads has many downsides. The JVM already pools native threads, so mostly what's saved by thread pools is the allocation initialization of new Thread instances. There are also downsides to thread pools. They alter ThreadLocal semantics and generally add complexity that may not be warranted. Like most optimizations, use of thread pools should be motivated by benchmarks. I followed this same logic in ParallelWriter and got burned. My first implementation (still the version submitted as a patch in jira) used dynamic threads to add the subdocuments to the parallel subindexes simultaneously. This hit a problem with abnormal native heap OOM's in the jvm. At first I thought it was simply a thread stack size / java heap size configuration issue, but adjusting these did not resolve the issue. This was on linux. ps -L showed large numbers of defunct threads. jconsole showed enormous growing total-ever-allocated thread counts. I switched to a thread pool and the issue went away with the same config settings. So, I'm not convinced the jvm does such a good job a pooling OS native threads. Re. ThreadLocals, I agree the semantics are different, but arguably they are most useful with thread pools. With dynamic threads, you get a reallocation every time, while with thread pools you avoid constant reallocations. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallelMultiSearcher reimplementation
Chris Hostetter wrote on 11/03/2006 09:40 AM: : Is there any timeline for when Java 1.5 packages will be allowed? I don't think i'll incite too much rioting to say no there is no timeline .. I may incite some rioting by saying my guess is 1.5 packages will be supported when the patches requiring them become highly desired. Not being shy about inciting riots, the problem with this approach is that people using Java 1.5 are discouraged from submitting patches to being with. Doug Cutting wrote on 11/03/2006 08:39 AM: Please consider breaking these into separate patches, one to permit ParallelMultiSearcher w/ HitCollector to not be single-threaded, and another to re-implement things with a thread pool. The latter is more controversial, and it would be a shame to have the former wait on it. Why would a thread pool be more controversial? Dynamically creating and garbaging threads has many downsides. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Include BM25 in Lucene?
Vic Bancroft wrote on 10/17/2006 02:44 AM: In some of my group's usage of lucene over large document collections, we have split the documents across several machines. This has lead to a concern of whether the inverse document frequency was appropriate, since the score seems to be dependant on the partioning of documents over indexing hosts. We have not formulated an experiment to determine if it seriously effects our results, though it has been discussed. What version of Lucene are you using? Are you using ParallelMultiSearcher to manage the distributed indexes or have you implemented your own mechanism? There was a bug a couple years ago, in the 1.4.3 version as I recall, where ParallelMultiSearcher was not computing df's appropriately, but that has been fixed for a long time now. The df's are the sum of the df's from each distributed index and thus are independent of the partitioning. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ferret's changes
David Balmain wrote on 10/10/2006 08:53 PM: On 10/11/06, Chuck Williams [EMAIL PROTECTED] wrote: I personally would always store term vectors since I use a StandardTokenizer and Stemming. In this case highlighting matches in small documents is not trivial. Ferret's highlighter matches even sloppy phrase queries and phrases with gaps between the terms correctly. I couldn't do this without the use of term vectors. I use stemming as well, but am not yet matching phrases like that. Perhaps term vectors will be useful to achieve this, although they come at a high cost and it doesn't seem difficult or expensive to do the matching directly on the text of small items. I suppose it would be possible for the single conceptual field 'body' to be represented with two physical fields 'smallBody' and 'largeBody' where the former stores term vectors and the latter does not. If I really wanted to solve this problem I would use this solution. It is pretty easy to search multiple fields when I need to. Ferret's Query language even supports it: smallBody|largeBody:phrase to search for Couldn't agree more. I have a number of extensions to Lucene's query parser, including this for multiple fields: {smallBody largeBody}:phrase to search for In the end, I think the benifits of my model far outweight the costs. For me at least anyway. Based on the performance figures so far, it seems they do! I think dynamic term vectors have a substantial benefit, but can easily be implemented in model where all field indexing properties are fixed. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Define end-of-paragraph
Reuven Ivgi wrote on 10/02/2006 09:32 PM: I want to divide a document to paragraphs, still having proximity search within each paragraph How can I do that? Is your issue that you want the paragraphs to be in a single document, but you want to limit proximity search to find matches only within a single paragraph? If so, you could parse your document into paragraphs and when generating tokens for it place large gaps at the paragraph boundaries. Each Token in lucene has a startOffset and endOffset that you can set as you generate Tokens inside TokenStream.next() for the TokenStream returned by your Analyzer. Those classes and methods are all in org.apache.lucene.analysis. Or alternatively, you could make each paragraph a separate field value and use Analyzer.getPositionIncrementGap() to achieve essentially the same thing (except that your Documents could get unwieldy if you that have many paragraphs). If this is not what you are trying to do, then please explain your objectives precisely. Good luck, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Define end-of-paragraph
Hi Reuven, In my haste last night, I pointed you at the wrong fields on Token. You need to set the position to create inter-paragraph gaps, not the offsets, so you want Token.setPositionIncrement() for that approach, or Analyzer.getPositionIncrementGap() if you use the multi-field approach. You will likely have performance problems with Documents that have thousands of fields, so I would not recommend that approach. Are you only matching paragraphs rather than whole documents? If so, another approach would be to make each paragraph a separate document. Then you could store document and paragraph id's in separate fields and have all the information you want. If you need whole document matching, but want the paragraph number of matches, one approach might be to use SpanQuery's together with a position-encoding of paragraph numbers. E.g., place you paragraphs starting at positions 0, 1, 2, 3, ... Then from the positions on the spans you find, you can identify what paragraph you are in. I'm sure you can come up with many other ways to represent this information as well. Hope this helps, Chuck Reuven Ivgi wrote on 10/02/2006 11:27 PM: Hello, To be more precise, the basic entity I am using is a document, each with paragraphs which may be up to few thousands. I need the proximity search within a paragraph, yet, I want to get as a search result the paragraph number also. Maybe, defining each paragraph as separate field it the best way What do you think? Thanks in advance Reuven Ivgi -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 03, 2006 10:58 AM To: java-dev@lucene.apache.org Subject: Re: Define end-of-paragraph Reuven Ivgi wrote on 10/02/2006 09:32 PM: I want to divide a document to paragraphs, still having proximity search within each paragraph How can I do that? Is your issue that you want the paragraphs to be in a single document, but you want to limit proximity search to find matches only within a single paragraph? If so, you could parse your document into paragraphs and when generating tokens for it place large gaps at the paragraph boundaries. Each Token in lucene has a startOffset and endOffset that you can set as you generate Tokens inside TokenStream.next() for the TokenStream returned by your Analyzer. Those classes and methods are all in org.apache.lucene.analysis. Or alternatively, you could make each paragraph a separate field value and use Analyzer.getPositionIncrementGap() to achieve essentially the same thing (except that your Documents could get unwieldy if you that have many paragraphs). If this is not what you are trying to do, then please explain your objectives precisely. Good luck, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: After kill -9 index was corrupt
Hi All, I found this issue. There is no problem in Lucene, and I'd like to leave this thread with that assertion to avoid confusing future archive searcher/readers. The index was actually not corrupt at all. I use ParallelReader and ParallelWriter. A kill -9 can leave the subindexes out of sync. My recovery code repairs this on restart by noticing the indexes are out-of-sync, deleting the document(s) that were added to some subindex(es) but not the other(s), then optimizing to resync the doc-ids. The issue is that my bulk updater does not at present support compound file format and the recovery code forgot to turn that off prior to the optimize! Thus a .cfs file was created, which confused the bulk updater -- it did not see a segment that was inside the cfs. Sorry for the false alarm and thanks to all who helped with the original question/concern, Chuck Chuck Williams wrote on 09/11/2006 12:10 PM: I do have one module that does custom index operations. This is my bulk updater. It creates new index files for the segments it modifies and a new segments file, then uses the same commit mechanism as merging. I.e., it copes its new segments file into segments with the commit lock only after all the new index files are closed. In the problem scenario, I don't have any indication that the bulk updater was complicit but am of course fully exploring that possibility as well. The index was only reopened by the process after the kill -9 of the old process was completed, so there were not any threads still working on the old process. This remains a mystery. Thanks for you analysis and suggestions. If you have more ideas, please keep them coming! Chuck robert engels wrote on 09/11/2006 10:06 AM: I am not stating that you did not uncover a problem. I am only stating that it is not due to OS level caching. Maybe your sequence of events triggered a reread of the index, while some thread was still writing. The reread sees the 'unused segments' and deletes them, and then the other thread writes the updated 'segments' file. From what you state, it seems that you are using some custom code for index writing? (Maybe the NewIndexModified stuff)? Possibly there is an issue there. Do you maybe have your own cleanup code that attempts to remove unused segments from the directory? If so, that appears to be the likely culprit to me. On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote: robert engels wrote on 09/11/2006 07:34 AM: A kill -9 should not affect the OS's writing of dirty buffers (including directory modifications). If this were the case, massive system corruption would almost always occur every time a kill -9 was used with any program. The only thing a kill -9 affects is user level buffering. The OS always maintains a consistent view of directory modifications and or file modification that were requesting by programs. This entire discussion is pointless. Thanks everyone for your analysis. It appears I do not have any explanation. In my case, the process was in gc-limbo due to the memory leak and having butted up against its -Xmx. The process was kill -9'd and then restarted. The OS never crashed. The server this is on is healthy; it has been used continually since this happened without being rebooted and no file system or any other issues. When the process was killed, one thread was merging segments as part of flushing the ram buffer while closing the index, due to the prior kill -15. When Lucene restarted, the segments file contained a segment name for which there were no corresponding index data files. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: After kill -9 index was corrupt
Paul Elschot wrote on 09/10/2006 09:15 PM: On Monday 11 September 2006 02:24, Chuck Williams wrote: Hi All, An application of ours under development had a memory link that caused it to slow interminably. On linux, the application did not response to kill -15 in a reasonable time, so kill -9 was used to forcibly terminate it. After this the segments file contained a reference to a segment whose index files were not present. I.e., the index was corrupt and Lucene could not open it. A thread dump at the time of the kill -9 shows that Lucene was merging segments inside IndexWriter.close(). Since segment merging only commits (updates the segments file) after the newly merged segment(s) are complete, I expect this is not the actual problem. Could a kill -9 prevent data from reaching disk for files that were previously closed? If so, then Lucene's index can become corrupt after kill -9. In this case, it is possible that a prior merge created new segment index files, updated the segments file, closed everything, the segments file made it to disk, but the index data files and/or their directory entries did not. If this is the case, it seems to me that flush() and FileDescriptor.sync() are required on each index file prior to close() to guarantee no corruption. Additionally a FileDescriptor.sync() is also probably required on the index directory to ensure the directory entries have been persisted. Shouldn't the sync be done after closing the files? I'm using sync in a (un*x) shell script after merges before backups. I'd prefer to have some more of this syncing built into Lucene because the shell sync syncs all disks which might be more than needed. So far I've had no problems, so there was no need to investigate further. I believe FileDescriptor,sync() uses fsync and not sync on linux. A FileDescriptor is no longer valid after the stream is closed, so sync() could not be done on a closed stream. I think the correct protocol is flush() the stream, sync() it's FD, then close() it. Paul, do you know if kill -9 can create the situation where bytes from a closed file never make it to disk in linux? I think Lucene needs sync() in any event to be robust with respect to OS crashes, but am wondering if this explains my kill -9 problem as well. It seems bogus to me that a closed file's bytes would fail to be persisted unless the OS crashed, but I can't find any other explanation and I can't find any definitive information to affirm or refute this possible side effect of kill -9. The issue I've got is that my index can never lose documents. So I've implemented journaling on top of Lucene where only the last maxBufferedDocs documents are journaled and the whole journal is reset after close(). My application has no way to know when the bytes make it to disk, and so cannot manage its journal properly unless Lucene ensures index integrity with sync()'s. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: After kill -9 index was corrupt
robert engels wrote on 09/11/2006 07:34 AM: A kill -9 should not affect the OS's writing of dirty buffers (including directory modifications). If this were the case, massive system corruption would almost always occur every time a kill -9 was used with any program. The only thing a kill -9 affects is user level buffering. The OS always maintains a consistent view of directory modifications and or file modification that were requesting by programs. This entire discussion is pointless. Thanks everyone for your analysis. It appears I do not have any explanation. In my case, the process was in gc-limbo due to the memory leak and having butted up against its -Xmx. The process was kill -9'd and then restarted. The OS never crashed. The server this is on is healthy; it has been used continually since this happened without being rebooted and no file system or any other issues. When the process was killed, one thread was merging segments as part of flushing the ram buffer while closing the index, due to the prior kill -15. When Lucene restarted, the segments file contained a segment name for which there were no corresponding index data files. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: After kill -9 index was corrupt
I do have one module that does custom index operations. This is my bulk updater. It creates new index files for the segments it modifies and a new segments file, then uses the same commit mechanism as merging. I.e., it copes its new segments file into segments with the commit lock only after all the new index files are closed. In the problem scenario, I don't have any indication that the bulk updater was complicit but am of course fully exploring that possibility as well. The index was only reopened by the process after the kill -9 of the old process was completed, so there were not any threads still working on the old process. This remains a mystery. Thanks for you analysis and suggestions. If you have more ideas, please keep them coming! Chuck robert engels wrote on 09/11/2006 10:06 AM: I am not stating that you did not uncover a problem. I am only stating that it is not due to OS level caching. Maybe your sequence of events triggered a reread of the index, while some thread was still writing. The reread sees the 'unused segments' and deletes them, and then the other thread writes the updated 'segments' file. From what you state, it seems that you are using some custom code for index writing? (Maybe the NewIndexModified stuff)? Possibly there is an issue there. Do you maybe have your own cleanup code that attempts to remove unused segments from the directory? If so, that appears to be the likely culprit to me. On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote: robert engels wrote on 09/11/2006 07:34 AM: A kill -9 should not affect the OS's writing of dirty buffers (including directory modifications). If this were the case, massive system corruption would almost always occur every time a kill -9 was used with any program. The only thing a kill -9 affects is user level buffering. The OS always maintains a consistent view of directory modifications and or file modification that were requesting by programs. This entire discussion is pointless. Thanks everyone for your analysis. It appears I do not have any explanation. In my case, the process was in gc-limbo due to the memory leak and having butted up against its -Xmx. The process was kill -9'd and then restarted. The OS never crashed. The server this is on is healthy; it has been used continually since this happened without being rebooted and no file system or any other issues. When the process was killed, one thread was merging segments as part of flushing the ram buffer while closing the index, due to the prior kill -15. When Lucene restarted, the segments file contained a segment name for which there were no corresponding index data files. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
After kill -9 index was corrupt
Hi All, An application of ours under development had a memory link that caused it to slow interminably. On linux, the application did not response to kill -15 in a reasonable time, so kill -9 was used to forcibly terminate it. After this the segments file contained a reference to a segment whose index files were not present. I.e., the index was corrupt and Lucene could not open it. A thread dump at the time of the kill -9 shows that Lucene was merging segments inside IndexWriter.close(). Since segment merging only commits (updates the segments file) after the newly merged segment(s) are complete, I expect this is not the actual problem. Could a kill -9 prevent data from reaching disk for files that were previously closed? If so, then Lucene's index can become corrupt after kill -9. In this case, it is possible that a prior merge created new segment index files, updated the segments file, closed everything, the segments file made it to disk, but the index data files and/or their directory entries did not. If this is the case, it seems to me that flush() and FileDescriptor.sync() are required on each index file prior to close() to guarantee no corruption. Additionally a FileDescriptor.sync() is also probably required on the index directory to ensure the directory entries have been persisted. A power failure or other operating system crash could cause this, not just kill -9. Does this seem like a possible explanation and fix for what happened? Could the same kind of problem happen on Windows? If this is the issue, then how would people feel about having Lucene do sync()'s a) always? or b) as an index configuration option? I need to fix whatever happened and so would submit a patch to resolve it. Thanks for advice and suggestions, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining search steps without re-searching
I presume your search steps are anded, as in typical drill-downs? From a Lucene standpoint, each sequence of steps is a BooleanQuery of required clauses, one for each step. To add a step, you extend the BooleanQuery with a new clause. To not re-evaluate the full query, you'd need some query that regenerated the results of the prior step more efficiently than BooleanQuery. For example, if you happened to generate the entire result set for each step, presumably not feasible, then the results might be cached for regeneration. Assuming you cannot generate the entire result set, it's not obvious to me how having partially generated S1 and ... Sn-1 will help you generate S1 and ... Sn any faster. You will already get the the benefit of OS caching with Lucene as it stands. You might find further caching extension to the query types you use to be a performance gain that achieves what you want. You might also consider some kind of query optimization by extending the rewrite() methods. Chuck Fernando Mato Mira wrote on 08/28/2006 12:21 AM: Hello, We think we would have a problem if we try to use lucene because we do search combinations which might have hundreds of steps, so creating a combined query and executing again each time might be a problem. What would entail overhauling Lucene to do search combinations by taking advantage of the results already generated? Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining search steps without re-searching
Andrzej Bialecki wrote on 08/28/2006 09:19 AM: Chuck Williams wrote: I presume your search steps are anded, as in typical drill-downs? From a Lucene standpoint, each sequence of steps is a BooleanQuery of required clauses, one for each step. To add a step, you extend the BooleanQuery with a new clause. To not re-evaluate the full query, ... umm, guys, wouldn't a series of QueryFilter's work much better in this case? If some of the clauses are repeatable, then filtering results through a cached BitSet in such filtered query would work nicely, right? If the possible initial steps comprise a small finite set, I could see that as a winner. In my app for instance, the drill-down selectors are dynamic and drawn from a large set of possibilities. It's hard to see how any small set of filters would be much of a benefit. A large set of filters would consume too much space. For a 10 million document node at 1.25 megabytes per filter even a couple hundred filters adds up to something significant. As I understand things, filters take considerably more time to initially create but then can more than make this up through repetitive use. So they are a winner iff there are a small number of specific steps that are frequently and disproportionately used. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-659) [PATCH] PerFieldAnalyzerWrapper fails to implement getPositionIncrementGap()
[PATCH] PerFieldAnalyzerWrapper fails to implement getPositionIncrementGap() Key: LUCENE-659 URL: http://issues.apache.org/jira/browse/LUCENE-659 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.0.1, 2.1 Environment: Any Reporter: Chuck Williams Attachments: PerFieldAnalyzerWrapper.patch The attached patch causes PerFieldAnalyzerWrapper to delegate calls to getPositionIncrementGap() to the analyzer that is appropriate for the field in question. The current behavior without this patch is to always use the default value from Analyzer, which is a bug because PerFieldAnalyzerWrapper should behave just as if it was the analyzer for the selected field. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Strange behavior of positionIncrementGap
Hi All, There is a strange treatment of positionIncrementGap in DocumentWriter.invertDocument().The gap is inserted between all values of a field, except it is not inserted between values if the prefix of the value list up to that point has not yet generated a token. For example, if a field F has values A, B and C the following example cases arise: 1. A and B both generate no tokens == no positionIncrementGaps are generated 2. A has no tokens but B does == just the gap between B and C 3. A has tokens but B and C do not == both gaps between A and B, and between B and C are generated So, empty fields are treated anomalously. They are ignored for gap purposes at the beginning of the field list, but included if they occur later in the field list. This issue caused a subtle bug in my bulk update operation because to modify values and update the postings it must reanalyze them with precisely the same positions used when they were originally indexed. So, I had to match this previously unnoticed strange behavior. I could post a patch to fix this, but am concerned it might introduce upward incompatibilities in various implementations and applications that are dependent on Lucene index format. If that is not a concern in this case, please let me know and I'll post a patch. I at least wanted to report it. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange behavior of positionIncrementGap
Chris Hostetter wrote on 08/11/2006 09:08 AM: (using lower case to indicate no tokens produced and upper case to indicate tokens were produced) ... 1) a b C _gap_ D ...results in: C _gap_ D 2) a B _gap_ C _gap_ D ...results in: B _gap_ C _gap_ D 3) A _gap_ b _gap_ c _gap_ D ...results in: A _double_gap_ D ...is that the behavior you are seeing? Almost. The only difference is that case 3 has 3 gaps, so it's A _triple_gap_ D. Only case #3 seems wrongish to me there. ... i started to explain why i thought it made sense to go ahead and fix this, where by fix i ment only insert one gap in case#3 ... and then realized i was acctually arguing in favor of the current behavior for case#3, here is why... based on the semi-frequently discussed usage of token gap sizes to denote sentence/paragraph/page boundaries for the purpose of sloppy phrase queries, it certianly seems worthwhile to fix to me (so that queries like find Erik within 3 pages of Otis still work even if one of those pages is blank ... ...that's when i realized the current behavior of case#3 is acctually important for accurate matching, otherwise a search for two words within a certain number of pages would have a false match if those pages were blank. case #1 seems fine, but case #2 seems like the wrong case to me know, becuase trying to find occurances of B on page #1 using a SpanFirst query will have false positives ... it seems like the positionIncrimentGap should always be called/used after any field value is added (even if the value results in no okens) before the next value is added (even if that value results in no tokens) Does this jive with what you were expecting, and the patch you were considering? Precisely. The same concern about SpanFirstQuery also applies to case 1. My bulk update code was always generating the positionIncrementGap between all field values, so if there are 4 values it would always generate 3 gaps independent of whether or not the values generate tokens. For your cases it generated: 1) a b C D ...results in: _gap_ _gap_ C _gap_ D 2) a B C D ...results in: _gap_ B _gap_ C _gap_ D 3) A b c D ...results in: A _gap_ _gap_ _gap_ D This seems a natural behavior and is consistent with the use cases you describe (which are essentially the same reason I'm using gaps, and presumably the main purpose of gaps). Hoss, do you think it would be ok to fix given the potential upward incompatibility for index-format-dependent implementaitons? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for Semantic search
I have built such a system, although not with Lucene at the time. I doubt you need to modify anything in Lucene to achieve this. You may want to index words, stems and/or concepts from the ontology. Concepts from the ontology may relate to words or phrases. Lucene's token structure is flexible, supporting all of these. E.g., you can create your own Analyzer that looks up words and phrases in your ontology and then generates appropriate concept tokens that supplement the word/stem tokens. Concept tokens can similarly span phrases. Presuming you want some kind of word sense disambiguation through context, you can either integrate your model into the Analyzer or create a separate pre-processor. The same Analyzer or a variant of it could be used to map the Query into tokens to search. This would support concept--concept searches, useful for example in cross-language search. Word sense disambiguation is generally more difficult in typically short queries, so there are alternatives worth considering. E.g., you could expand queries (or index tokens) into the full set of possibilities (synonym words or concepts). If you have an a-priori or contextual ranking of those possibilities, you can generate boosts in Lucene to reflect that. If all you want is ontologic search, there are your hooks. If you want more sophisticated query transformations, e.g. for natural language QA, you probably want a custom query pre-processor to generate the specific queries you want. Hope these thoughts are useful, Chuck Chris Wildgoose wrote on 07/20/2006 11:19 AM: I have been working with Lucene for some time, and I have an interest in developing a Semantic Search solution. I was looking into extending lucene for this. I know this would involve some significant re-engineering of the indexing procedure to support the ability to assign words to nodes within an ontology. In addition the query would need to be modified. I was wondering whether anyone out there had gone down this path? Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene/Netbean Newbie looking for help
Hi Peter, I'm also a Netbeans user, ableit a very happy one who would never consider eclipse! The following sequence of steps has worked for me in netbeans 4.0 and 5.0 (haven't upgraded to 5.5 quite yet). The reason for the unusual directory structure is that Lucene's interleaving of the core and the various contribs within a single directory tree is incompatible with netbeans standard assumptions. This is worked around by having all the project files external to the Lucene directory tree; each can point at its build script, source package, etc., in the same directory tree. 1. Create a parent directory for all of your projects, say Projects. 2. Check lucene out of svn into Projects/LuceneTrunk. 3. Create new netbeans for core and whatever contribs you use, all parallel to Projects/LuceneTrunk. E.g., Projects/Lucene (the core), Projects/Highlighter, Projects/Snowball, etc.. For each project (e.g., Lucene), do: 1. File - New Project - General - Java Project with Existing Ant Script 2. Set the project location: Projects/LuceneTrunk 3. Set the build script (defaults correctly): ../LuceneTrunk/build.xml 4. Set the project name: Lucene 5. Set the project location: Projects/Lucene 6. Update the ant targets (build == jar, not compile; rest are correct; add custom targets for jar-demo, javacc, javadocs and docs) 7. Set the source package folders: ../LuceneTrunk/src/java 8. Set the test package folders: ../LuceneTrunk/src/test and ../LuceneTrunk/src/demo 9. Finish (no classpath settings) 10. Build the source (Lucene project context menu - Build) 11. Set the class path for src/demo (Lucene context menu - Properties - Java Sources Classpath - select src/demo - Add Jar/Folder LuceneTrunk/build/lucene-core-version-dev.jar 12. Build the demos (Lucene context menu - jar-demo) 13. Set the classpath for src/test (as above, add both the core jar and the demo jar) 14. Now run the tests (Lucene context menu - Test Project) All works great. From here on, all netbeans features are available (debugging, refactoring, code database, completion, ...) You can also of course run ant from the command line, should you ever want to. Good luck, Chuck peter decrem wrote on 07/10/2006 07:05 PM: I am trying to contribute to the dot lucene port, but I am having no luck in getting the tests to compile and debug for the java version. I tried eclipse and failed and now I am stuck in Netbean. More specifically I am using Netbean 5.5 (same problems with 5.0). My understanding is that it comes with junit standard (3.8). I did create a build.properties file for javacc. It compiles but I get the following error when I run the tests: compile-core: compile-demo: common.compile-test: compile-test: test: C:\lucene-1.9.1\common-build.xml:169: C:\lucene-1.9.1\lib not found. BUILD FAILED (total time: 0 seconds) The relevant code in common-build.xml is: target name=test depends=compile-test description=Runs unit tests fail unless=junit.present ## JUnit not found. Please make sure junit.jar is in ANT_HOME/lib, or made available to Ant using other mechanisms like -lib or CLASSPATH. ## /fail mkdir dir=${junit.output.dir}/ junit printsummary=off haltonfailure=no line 169 XX- errorProperty=tests.failed failureProperty=tests.failed classpath refid=junit.classpath/ !-- TODO: create propertyset for test properties, so each project can have its own set -- sysproperty key=dataDir file=src/test/ sysproperty key=tempDir file=${build.dir}/test/ Any suggestions? Or any pointers to getting the tests to work in netbeans are appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- *Chuck Williams* Manawiz Principal V: (808)885-8688 C: (415)846-9018 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Skype: manawiz AIM: hawimanawiz Yahoo: jcwxx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
David Balmain wrote on 07/10/2006 01:04 AM: The only problem I could find with this solution is that fields are no longer in alphabetical order in the term dictionary but I couldn't think of a use-case where this is necessary although I'm sure there probably is one. So presumably fields are still contiguous, you keep a pointer to where each field starts, and terms within the field remain in alphabetical order? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
Chris Hostetter wrote on 07/10/2006 02:06 AM: As near as i can tell, the large issue can be sumarized with the following sentiment: Performance gains could be realized if Field properties were made fixed and homogeneous for all Documents in an index. This is certainly a large issue, as David says he has achieved a 5x performance gain. My interest in global field semantics originally sprang from functionality considerations, not performance considerations. I've got many features that require reasoning about field semantics. I previously mentioned a very simple one: validating fields in the query parser. More interesting examples are: 1. Multiple inheritance on the fields of documents that record the sources of each inherited value to support efficient incremental maintenance 2. Record-valued fields that store facets with values (e.g., time and user information for who set that value). These cannot easily be broken into multiple fields because the fields in question are multi-valued. 3. Join fields that reference id's of objects stored in separate indices (supporting queries that reference the fields in the joined index) Managing these kinds of rich semantic features in query parsing and indexing is greatly facilitated by a global field model. I've built this into my app, and then started thinking about benefits in Lucene generally from such a model. 1) all Fields and their properties must be predeclared before any document is ever added to the index, and any Field not declared is illegal. 2) a Field springs into existence the first time a Document is added with a value for it -- but after that all newly added Documents with a value for that field must conform to the Field properites initially used. (have I missed any general approaches?) Yes. Here is (an elaboration of) the global model with exceptions idea we reached: 3) There is a global field model in Lucene that contains the list of all known fields and their default semantics. The class that contains this model supports a number of implicit and explicit methods to construct and query the model. The model can be evolved. The model is used many places in Lucene, in some cases according to application-settable properties. E.g.: a) Creating a Field uses the properties of the model so they need not be specified at each construction. A global model property determines whether or not field properties may be overridden, and whether or not fields may be created that are not in the model (in which case, they are automatically added to the model). b) The query parser has hooks that affect Query generation based on the model properties of the field (not just for certain special query types like Term's and RangeQuery's). The application can easily provide methods to implement these hooks. This is essential for features like 23 above (and beneficial for 1). How would something like this work? docA.add(new Field(f, bar, Store.YES, Index.UN_TOKENIZED)): docA.add(new Field(f, foo, Store.NO, Index.TOKENIZED)): docB.add(new Field(f, x y, Store.YES, Index.TOKENIZED)): docB.add(new Field(f, z, Store.NO, Index.UN_TOKENIZED)): The application could determine whether or not this kind of operation was supported accorded to the global enforcement properties of the model. If this is needed, the ability to have exceptions at the Field level would permit it. Hoss, do you have a use case requiring Store and Index variance like this? The impact of this flexibility on David's 5x is another question... Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Yonik Seeley wrote on 07/10/2006 09:27 AM: I'll rephrase my original question: When implementing NewIndexModifier, what type of efficiencies do we get by using the new protected methods of IndexWriter vs using the public APIs of IndexReader and IndexWriter? I won't comment on Ning's implementation, but will comment wrt this issue for related work I've done with bulk update. I needed at least package-level access to several of the private capabilities in the index package (e.g., from SergmentMerger: resetSkip(), bufferSkip(), writeSkip(); from IndexWriter: readDeletableFiles(), writeDeletableFiles(); etc.). I think the index package and its api's have not been designed from the standpoint of update (batched delete/add or bulk), and are not nearly as friendly to application-level specialization/customization as other parts of Lucene. As part of the new index representation being considered now, I hope that these issues are addressed, and would be happy to participate in addressing them (especially if gcj releases 1.5 and 1.5 code becomes acceptable). Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-509) Performance optimization when retrieving a single field from a document
[ http://issues.apache.org/jira/browse/LUCENE-509?page=comments#action_12419926 ] Chuck Williams commented on LUCENE-509: --- LUCENE-545 does resolve this in a more general way, although the code to get precisely one field value efficiently is slightly clunky, requiring something like this (for a single-valued field): final seekfield = retrievefield.intern(); String value = reader.document(doc, new FieldSelector(){ FieldSelectorResult accept(String field) { if (field==seekfield) return FieldSelectorResult.LOAD_AND_BREAK; else return FieldSelectorResult.NO_LOAD; }).get(seekfield); Even with this, a Document, a Field and a FieldSelector are created unnecessarily. There are important cases where fast single-field-access is important. E.g., I have cases where it is necessary to obtain the id field for all results of a query, leading to (an obviously refactored version of) the above code in a HitCollector. I think some special optimization for the single-field access case makes sense if benchmarks show it is material, but that it should be integrated with the mechanism of LUCENE-545. $0.02, Chuck Performance optimization when retrieving a single field from a document --- Key: LUCENE-509 URL: http://issues.apache.org/jira/browse/LUCENE-509 Project: Lucene - Java Type: Improvement Components: Index Versions: 1.9, 2.0.0 Reporter: Steven Tamm Assignee: Otis Gospodnetic Attachments: DocField.patch, DocField_2.patch, DocField_3.patch, DocField_4.patch, DocField_4b.patch If you just want to retrieve a single field from a Document, the only way to do it is to retrieve all the fields from the Document and then search it. This patch is an optimization that allows you retrieve a specific field from a document without instantiating a lot of field and string objects. This reduces our memory consumption on a per query basis by around around 20% when a lot of documents are returned. I've added a lot of comments saying you should only call it if you only ever need one field. There's also a unit test. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
Marvin Humphrey wrote on 07/08/2006 11:13 PM: On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time. http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html I think there are a number of problems with that proposal and hope it was not adopted. As my earlier example showed, there is at least one valid use case where storing a term vector is not an invariant property of a field; specifically, when using term vectors to optimize excerpt generation, it is best to store them only for fields that have long values. This is even a counter-example to Karl's proposal, since a single Document may have multiple fields of the same name, some with long values and others with short values; multiple fields of the same name may legitimately have different TermVector settings even on a single Document. As another counter-example from my own app which I'd forgotten yesterday, an important case where the Analyzer will vary across documents is for i18n, where different languages require different analyzers. Refuting again my own argument about this not being consistent with query parsing, the language of the query is a distinct property from the languages of various documents in the collection. In my app, I let the user specify the language of the query, while the language of each Document is determined automatically. So, analyzers vary for both queries and documents, but independently. I haven't thought of cases where Index or Store would legitimately vary across Fields or Documents, but am less convinced there aren't important use cases for these as well. Similarly, although it is important to allow term vectors to be on or off at the field level, I don't see any obvious need to vary the type of term vector (positions, offsets or both). There are significant benefits to global semantics, as evidenced by the fact that several of us independently came to desire this. However, deciding what can be global and what cannot is more subtle. Perhaps the best thing at the Lucene level is to have a notion of default semantics for a field name. Whenever a Field of that name is constructed, those semantics would be used unless the constructor overrides them. This would allow additional constructors on Field with simpler signatures for the common case of invariant Field properties. It would also allow applications to access the class that holds the default field information for an index. The application will know which properties it can rely on as invariant and whether or not the set of fields is closed. This approach would preserve upward compatibility and provide, I believe, most of the benefits we all seek. Thoughts? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
David Balmain wrote on 07/09/2006 06:44 PM: On 7/10/06, Chuck Williams [EMAIL PROTECTED] wrote: Marvin Humphrey wrote on 07/08/2006 11:13 PM: On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time. http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html I think there are a number of problems with that proposal and hope it was not adopted. Hi Chuck, Actually, it was adopted and I'm quite happy with the solution. I'd be very interested to hear what the number of problems are, besides the example you've already given. Even if you never use Ferret, it can only help me improve my software. Hi David, Thanks for your reply. I'm not aware of other problems beyond the ones I've already cited. After thinking of these, my confidence that there were not others waned. I'll start by covering your term-vector example. By adding fixed index-wide field properties to Ferret I was able to obtain up to a huge speed improvement during indexing. This is very interesting. Can you say how much? With the CPU time I gain in Ferret I could easily re-analyze large fields and build term vectors for them separately. It's a little more work for less common use cases like yours but in the end, everyone benifits in terms of performance. Does Ferret work this way, or would that be up to the application? As my earlier example showed, there is at least one valid use case where storing a term vector is not an invariant property of a field; specifically, when using term vectors to optimize excerpt generation, it is best to store them only for fields that have long values. This is even a counter-example to Karl's proposal, since a single Document may have multiple fields of the same name, some with long values and others with short values; multiple fields of the same name may legitimately have different TermVector settings even on a single Document. I think you'll find if you look at the DocumentWriter#writePostings method that it's one in, all in in terms of storing term vectors for a field. That is, if you have 5 content fields and only one of those is set to store term vectors, then all of the fields will store term vectors. Right you are, and clearly necessarily so since the values of the multiple fields are implicitly concatenated (with positionIncrementGap). So, Lucene already limits my term vector optimization to the Document level. As it happens, I only use it for large body fields, of which each of my Documents has at most one. I haven't thought of cases where Index or Store would legitimately vary across Fields or Documents, but am less convinced there aren't important use cases for these as well. Similarly, although it is important to allow term vectors to be on or off at the field level, I don't see any obvious need to vary the type of term vector (positions, offsets or both). I think Store could definitely legitimately vary across Fields or Documents for the same reason your term vectors do. Perhaps you are indexing pages from the web and you want to cache only the smaller pages. That's an interesting example, but not as compelling an objection to me (and seemingly not to you either!). The app could always store an empty string without much consequence in this scenario. There are significant benefits to global semantics, as evidenced by the fact that several of us independently came to desire this. However, deciding what can be global and what cannot is more subtle. I agree. I can't see global field semantics making it into Lucene in the short term. It's a rather large change, particularly if you want to make full use of the performance benifits it affords. Could you summarize where these derive from? Perhaps the best thing at the Lucene level is to have a notion of default semantics for a field name. Whenever a Field of that name is constructed, those semantics would be used unless the constructor overrides them. This would allow additional constructors on Field with simpler signatures for the common case of invariant Field properties. It would also allow applications to access the class that holds the default field information for an index. The application will know which properties it can rely on as invariant and whether or not the set of fields is closed. This approach would preserve upward compatibility and provide, I believe, most of the benefits we all seek. Thoughts? If this is all you
Global field semantics
Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. This approach would naturally admit a class, say IndexFieldSet, that would hold global field semantics for an index. Lucene today allows many field properties to vary at the Field level. E.g., the same field name might be tokenized in one Field on a Document while it is untokenized in another Field on the same or different Document. Does anybody know how often this flexibility is used? Are there interesting use cases for which it is important? It seems to me this functionality is already problematic and not fully supported; e.g., indexing can manage tokenization-variant fields, but query parsing cannot. Various extensions to Lucene exacerbate this kind of problem. Perhaps more controversially, the notion of global field semantics would be even stronger if the set of fields is closed. This would allow, for example, QueryParser to validate field names. This has a number of benefits, including for example avoiding false-negative no results due to misspelling a field name. Has this been considered before? Are there good reasons this path has not been followed? Thanks for any info, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))
Doug Cutting wrote on 07/08/2006 09:41 AM: Chuck Williams wrote: I only work in 1.5 and use its features extensively. I don't think about 1.4 at all, and so have no idea how heavily dependent the code in question is on 1.5. Unfortunately, I won't be able to contribute anything substantial to Lucene so long as it has a 1.4 requirement. The 1.5 decision requires a consensus. You're making ultimatums, which does not help to build consensus. By stating an inflexible position you've become a fact that informs the process. My statement was not intended as an ultimatum at all. Rather, it is simply a fact. I prefer to contribute to Lucene, but my workload simply does not allow time to be spent on backporting. I think we should try to minimize the number of inconvenienced people. Both developers and users are people. Some developers are happy to continue in 1.4, adding new features that users who are confined to 1.4 JVMs will be able to use. Other developers will only contribute 1.5 code, perhaps (unless we find a technical workaround) excluding users confined to 1.4 JVMs. But it is difficult to compare the inconvenience of a developer who refuses to code back-compatibly to a user who is deprived new features. Doug, respectfully, this issue is inflammatory in its nature. I've found a couple of your comments to be inflammatory, although I suspect you did not intend them that way. Specifically the term refuses above and your prior comment about considering use of your veto power if the committers were to vote to move to 1.5. I'm not refusing to do anything. I am overwhelmed in a crunch for the next several months and simply informing the community that I have code that others may find valuable that might be contributed, but that it requires 1.5 and that I cannot backport it. I cannot unilaterally decide to contribute the code, needing the agreement of the company I'm working for. They are only interested in the contribution if there is interest in having it in the core. These are simply facts. I suspect I'm not the only person in this kind of situation. Since GCJ is effectively available on all platforms, we could say that we will start accepting 1.5 features when a GCJ release supports those features. Does that seem reasonable? Seems like a reasonable compromise to me. If I had a vote on this it would be +1. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
karl wettin wrote on 07/08/2006 10:27 AM: On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote: Many things would be cleaner in Lucene if fields had a global semantics, Has this been considered before? Are there good reasons this path has not been followed? I've been posting some advocacy about the current Field. Basically I would like to see a more normalized field setting per document (instead of normalizing it in the writer), and I've been talking about something like this: [Document]#--- {1..*} -[Value]--[Field +name +store +index +vector] A | {0..*} | [Index] And what I'm after would look like this: [Document]#--- {1..*} -[Value] A | {*..1} | [Field +store +index +vector +analyzer +directory] A | {1..1} | [FieldName] A | {0..*} | [Index] The key points are to have Index be a first-class object and to have field names uniquely specify field properties. Karl, do you have specific reasons or use cases to normalize fields at Document rather than at Index? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
karl wettin wrote on 07/08/2006 12:27 PM: On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote: Karl, do you have specific reasons or use cases to normalize fields at Document rather than at Index? Nothing more than that the way the API looks it implies features that does not exist. Boost, store, index and vectors. I've learned, but I'm certain lots of newbies does the same assumptions as I did. I forgot one of my own use cases! My app uses term vectors as an optimization for determining excerpts (aka summaries). Term vectors increase the index size. For large documents, the performance benefits of using term vectors to find excerpts are large, but for small documents they are non-existent or negative. So, to optimize performance and minimize index size, I store term vectors on the relevant fields only when their values are sufficiently large. This is a concrete example of using the same field name with different Field.TermVector values on different Documents. Are there any similar examples for Field.Index or Field.Store? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))
DM Smith wrote on 07/07/2006 07:07 PM: Otis, First let me say, I don't want to rehash the arguments for or against Java 1.5. This is an emotional issue for people on both sides. However, I think you have identified that the core people need to make a decision and the rest of us need to go with it. It would be most helpful to have clarity on this issue. On Jul 7, 2006, at 1:17 PM, Otis Gospodnetic wrote: Hi Chuck, I think bulk update would be good (although I'm not sure how it would be different from batching deletes and adds, but I'm sure there is a difference, or else you wouldn't have done it). Bulk update works by rewriting all segments that contain a document to be modified in a single linear pass. This is orders of magnitude faster than delete/add if the set of documents to be updated is large, especially if only a few small fields are mutable on Documents that have many possibly large immutable fields. E.g., on a somewhat slow development machine I updated several fields on 1,000,000 large documents in 43 seconds. There is an existing patch in jira that takes this same approach (LUCENE-382). However the limitations in that patch are substantial: only optimized indexes, stored fields are not updated, updates are independent of the existing field value, etc. These limitations make that implementation not suitable for many use cases. My implementation eliminates all of those limitations, providing a fast flexible solution for applying an arbitrary value transformation to selected documents and fields in the index (doc.field.new_value = f(doc, field.old_value, doc.other_field_values) for arbitrary f). It also works with ParallelReader (and the ParallelWriter I've already contributed). This allows the mutable fields to be segregated into a separate subindex. Only that subindex need be updated. This alone is an enormous advantage over a large number of delete/add's where the same optimization is not possible due to the doc-id synchronization requirements of ParallelReader. There is a substantial amount of code required to do this, and it is completely dependent on the index representation. To simplify merge issues with ongoing Lucene changes, I had to copy and edit certain private methods out of the existing index code (and make extensive use of the package-only api's). Beyond normal benefits of open sourcing code, my interest in contributing this is to see the index code refactored to take bulk update into account. This is increased by the current focus on a new flexible index representation. I would like to see bulk update as one of the operations supported in the new representation. So I think you should contribute your code. This will give us a real example of having something possibly valuable, and written with 1.5 features, so we can finalize 1.4 vs. 1.5 discussion, probably with a vote on lucene-dev. I doubt any single contribution will change anyone's mind. I would like to have clarity on the 1.5 decision before deciding whether or not to contribute this and other things. My ParallelWriter contribution, which also requires 1.5, is already sitting in jira. I only work in 1.5 and use its features extensively. I don't think about 1.4 at all, and so have no idea how heavily dependent the code in question is on 1.5. Unfortunately, I won't be able to contribute anything substantial to Lucene so long as it has a 1.4 requirement. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
robert engels wrote on 07/06/2006 12:24 PM: I guess we just chose a much simpler way to do this... Even with you code changes, to see the modification made using the IndexWriter, it must be closed, and a new IndexReader opened. So a far simpler way is to get the collection of updates first, then using opened indexreader, for each doc in collection delete document using key endfor open indexwriter for each doc in collection add document endfor open indexreader I don't see how your way is any faster. You must always flush to disk and open the indexreader to see the changes. With the patch you can have ongoing writes and deletes happening asynchronously with reads and searches. Reopening the IndexReader to refresh its view is an independent decision. The IndexWriter need never be closed. Without the patch, you have to close the IndexWriter to do any deletes. If the requirements of your app prohibit batching updates for very long, this could be a frequent occurrence. So, it seems to me the patch has benefit for apps that do frequent updates and need reasonably quick access to those changes. Bulk updates however require yet another approach. Sorry to change topics here, but I'm wondering if there was a final decision on the question of java 1.5 in the core. If I submitted a bulk update capability that required java 1.5, would it be eligible for inclusion in the core or not? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Robert, Either you or I are missing something basic. I'm not sure which. As I understand things, an IndexWriter and an IndexReader cannot both have the write lock at the same time (they use the same write lock file name). Only an IndexReader can delete and only an IndexWriter can add. So to update, you need to close the IndexWriter, have the IndexReader delete, and then reopen the IndexWriter. With the patch, you never need to close the IndexWriter, as I said before. This provides a benefit in cases where updates cannot be combined into large batches. In this case without the patch the IndexWriter must be closed and reopened frequently, whereas with the patch it does not. Have I got something wrong? Chuck robert engels wrote on 07/06/2006 03:08 PM: I think I finally see how this is supposed to optimize - basically because it remember the terms, and then does the batch deletions. We avoid all of this messiness by just making sure each document has a primary key and we always remove/update by primary key and we can keep the operations in an ordered list (actually set since the keys are unique, and that way multiple updates to the same document in a batch can be coalesced). I guess still don't see why the change is so involved though... I would just maintain an ordered list of operations (deletes an adds) on the buffered writer. When the buffered writer is closed: Create a RamDirectory. Perform all deletions in a batch on the main IndexReader. Perform ordered deletes and adds on the RamDirectory. Merge the RamDirectory with the main index. This could all be encapsulated in a BufferedIndexWriter class. On Jul 6, 2006, at 4:34 PM, robert engels wrote: I guess I don't see the difference... You need the write lock to use the indexWriter, and you also need the write lock to perform a deletion, so if you just get the write lock you can perform the deletion and the add, then close the writer. I have asked how this submission optimizes anything, and I still can't seem to get an answer? On Jul 6, 2006, at 4:27 PM, Otis Gospodnetic wrote: I think that patch is for a different scenario, the one where you can't wait to batch deletes and adds, and want/need to execute them more frequently and in order they really are happening, without grouping them. Otis - Original Message From: robert engels [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, July 6, 2006 3:24:13 PM Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) I guess we just chose a much simpler way to do this... Even with you code changes, to see the modification made using the IndexWriter, it must be closed, and a new IndexReader opened. So a far simpler way is to get the collection of updates first, then using opened indexreader, for each doc in collection delete document using key endfor open indexwriter for each doc in collection add document endfor open indexreader I don't see how your way is any faster. You must always flush to disk and open the indexreader to see the changes. On Jul 6, 2006, at 2:07 PM, Ning Li wrote: Hi Otis and Robert, I added an overview of my changes in JIRA. Hope that helps. Anyway, my test did exercise the small batches, in that in our incremental updates we delete the documents with the unique term, and then add the new (which is what I assumed this was improving), and I saw o appreciable difference. Robert, could you describe a bit more how your test is set up? Or a short code snippet will help me explain. Without the patch, when inserts and deletes are interleaved in small batches, the performance can degrade dramatically because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. Is this how your test is set up? And, what are the maxBufferedDocs and the maxBufferedDeleteTerms in your test? You won't see a performance improvement if they are about the same as the small batch size. The patch works by internally buffering inserts and deletes into larger batches. Regards, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
The need to close the IndexWriter is no different with the patch for deletes than it already is for adds. This is a separate issue that can be managed asynchronously using the existing mechanism in the applicaiton. The patch ensures the proper order of operations, so the benefit remains. Applications can now freely add and delete without worrying about delete's forcing a close of the IndexWriter. I think we are all in agreement that delete really belongs in IndexWriter. I agree with Otis that IndexModifier should be deprecated for several reasons. I use an IndexManager that coordinates all of search, read, add, delete, update, etc. It manages the refreshes, the batches, bulk updates, etc. And does it all more efficiently than IndexManager. Haven't heard an answer yet whether or not 1.5 code contributions would be eligible for the core. Chuck robert engels wrote on 07/06/2006 08:01 PM: I think you still need to close the IndexWriter at some point, in order to search the new documents. In effect all of the changes using the buffered IndexWriter are meaningless until the IndexWriter is closed and a new IndexReader opened. Given that, it doesn't make much difference when you do the buffering... My statement about getting the lock once was not entirely correct as you point out, it needs to be grabbed in two stages, but a far more simple design (as I proposed) could be used - obviously some changes for lock management would be needed. I DO think that the deletion code should be moved to IndexWriter - it makes more sense there. The current design IS a bit goofy... I don't see why you would delete using an IndexReader - why be able to see deletions in the current IndexReader but not be able to see additions? What is the benefit? I really like the idea of the BufferedWriter - it is similar to what is proposed but I think the implementation would be far simpler and more straightforward. It would be similar to IndexModifier without the warning that you should do all the deletions first, and then all the additions - the BufferedWriter would manage this for you. On Jul 6, 2006, at 9:16 PM, Chuck Williams wrote: Robert, Either you or I are missing something basic. I'm not sure which. As I understand things, an IndexWriter and an IndexReader cannot both have the write lock at the same time (they use the same write lock file name). Only an IndexReader can delete and only an IndexWriter can add. So to update, you need to close the IndexWriter, have the IndexReader delete, and then reopen the IndexWriter. With the patch, you never need to close the IndexWriter, as I said before. This provides a benefit in cases where updates cannot be combined into large batches. In this case without the patch the IndexWriter must be closed and reopened frequently, whereas with the patch it does not. Have I got something wrong? Chuck robert engels wrote on 07/06/2006 03:08 PM: I think I finally see how this is supposed to optimize - basically because it remember the terms, and then does the batch deletions. We avoid all of this messiness by just making sure each document has a primary key and we always remove/update by primary key and we can keep the operations in an ordered list (actually set since the keys are unique, and that way multiple updates to the same document in a batch can be coalesced). I guess still don't see why the change is so involved though... I would just maintain an ordered list of operations (deletes an adds) on the buffered writer. When the buffered writer is closed: Create a RamDirectory. Perform all deletions in a batch on the main IndexReader. Perform ordered deletes and adds on the RamDirectory. Merge the RamDirectory with the main index. This could all be encapsulated in a BufferedIndexWriter class. On Jul 6, 2006, at 4:34 PM, robert engels wrote: I guess I don't see the difference... You need the write lock to use the indexWriter, and you also need the write lock to perform a deletion, so if you just get the write lock you can perform the deletion and the add, then close the writer. I have asked how this submission optimizes anything, and I still can't seem to get an answer? On Jul 6, 2006, at 4:27 PM, Otis Gospodnetic wrote: I think that patch is for a different scenario, the one where you can't wait to batch deletes and adds, and want/need to execute them more frequently and in order they really are happening, without grouping them. Otis - Original Message From: robert engels [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, July 6, 2006 3:24:13 PM Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) I guess we just chose a much simpler way to do this... Even with you code changes, to see the modification made using the IndexWriter, it must be closed, and a new IndexReader opened. So a far
Re: Memory Leak IndexSearcher
I'd suggest forcing gc after each n iteration(s) of your loop to eliminate the garbage factor. Also, you can run a profiler to see which objects are leaking (e.g., the netbeans profiler is excellent). Those steps should identify any issues quickly. Chuck robert engels wrote on 07/03/2006 07:40 AM: Did you try what was suggested? (-Xmx16m) and did you get an OOM? If not, there is no memory leak. On Jul 3, 2006, at 12:33 PM, Bruno Vieira wrote: Thanks for the answer, but I have isolated the cycle inside a loop on a static void main (String args[]) Class to test this issue.In this case there were no classes referencing the IndexSercher and the problem still happened. 2006/7/3, robert engels [EMAIL PROTECTED]: You may not have a memory leak at all. It could just be garbage waiting to be collected. I am fairly certain there are no memory leaks in the current Lucene code base (outside of the ThreadLocal issue). A simple way to verify this would be to add -Xmx16m on the command line. If there were a memory leak then it will eventually fail with an OOM. If there is a memory leak, then it is probably because your code is holding on to IndexReader references in some static var or map. On Jul 3, 2006, at 9:43 AM, Bruno Vieira wrote: Hi everyone, I am working on a project with around 35000 documents (8 text fields with 256 chars at most for each field) on lucene. But unfortunately this index is updated at every moment and I need that these new items be in the results of my search as fast as possible. I have an IndexSearcher, then I do a search getting the last 10 results with ordering by a name field and the memory allocated is 13mb, I close the IndexSearcher because the lucene database was updated by and external application and I create a new IndexSearcher, do the same search again wanting to get the last 10 results with ordering by a name field and the memory allocated is 15mb. At every time I do this cycle the memory increase in 2mb, so in a moment I have a memory leak. If the database is not updated and i do not create a new IndexSearcher i can do searches forever without memory leak. Why when I close an IndexSearcher (indexSearcher.close(); indexSearcher = new IndexSearcher(/database/) ;)after some searches with ordering and open a new one the memory is not free ? Thanks to any suggestions. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining Hits and HitCollector
IMHO, Hits is the worst class in Lucene. It's atrocities are numerous, including the hardwired 50 and the strange normalization of dividing all scores by the top score if the top score happens to be greater than 1.0 (which destroys any notion of score values having any absolute meaning, although many apps erroneously assume they do). It is quite easy to use a TopDocsCollector or a TopFieldDocCollector and do a better job than Hits does. For faceted search I use a SamplingHitCollector to gather the facet-determination sample. It takes as one of its constructor parameters, rankingCollector, an arbitrary HitCollector to gather the top scoring or top sorted results. Then it only takes one line of code to combine the two collectors: rankingCollector.collect(doc, score) within SamplingHitCollector.collect(). This all notwithstanding, a built-in class that combined Hits with a second HitCollector probably would be used by many people, although I would recommend the approach above as a better alternative. Chuck Nadav Har'El wrote on 06/27/2006 09:08 AM: Hi, Searcher.search(Query) returns a Hits object, useful for the display of top results. Searcher.search(Query, HitCollector) runs a HitsCollector for doing some sort of processing over all results. Unfortunately, there is currently no method to do both at the same time. For some uses, for example faceted search (that was discussed on this list a few times in the past), you need to do both: go over all results (and, for example, count how many results belong to each value), and at the same time build a Hits object (for displaying the top search results). Changing Searcher, and/or Hits to allow for doing both things at once should not be too hard, but before I go and do it (and submit the change as a patch), I was wondering if I'm not reinventing the wheel, and if perhaps someone has already done this, or there were already discussions on how or how not to do it. Thanks, Nadav. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-609) Lazy field loading breaks backward compat
[ http://issues.apache.org/jira/browse/LUCENE-609?page=comments#action_12417188 ] Chuck Williams commented on LUCENE-609: --- I'm late to the discussion and have only read the patch file, but it seems invalid to me. Won't getField() get a class cast exception when it encounters a Fieldable that is not a Field? The semantics of getField() would have to be something like, only get this field if it is a Field rather than some other kind of Fieldable, which means it would have to do type testing on the members of fields. I think it is much better to remove this patch and leave Fieldable as is. Searchable was the same kind of thing. IndexReader is an abstract super class for the different types of readers. When I did ParallelWriter, I had the same problem and had to introduce Writable since IndexWriter is not an abstract class and ParallelWriter is a different implementation. I think it is best to introduce all the abstract classes now for fundamental types that have multiple implementations. Chuck Lazy field loading breaks backward compat - Key: LUCENE-609 URL: http://issues.apache.org/jira/browse/LUCENE-609 Project: Lucene - Java Type: Bug Components: Other Versions: 2.0.1 Reporter: Yonik Seeley Assignee: Yonik Seeley Fix For: 2.0.1 Attachments: fieldable_patch.diff Document.getField() and Document.getFields() have changed in a non backward compatible manner. Simple code like the following no longer compiles: Field x = mydoc.getField(x); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
that there are great new features in 1.6 - that would improve the lucene core if they were used - I think that that is when this issue gets revisited. This isn't the type of question that should be decided by a poll. This OG: The poll was about what do you use, and not what version of Java should Lucene support. I hope this wasn't misinterpreted by those who took the poll. should be decided by thoughtfully looking at the consequences of each choice. For me - the negative consequences of choosing 1.5 - leaving behind a lot of users - is much worse than the negative consequences of staying at 1.4 - making a couple dozen highly skilled developers check an extra box in their lucene development environments? OG: I don't think the checkbox will remove 1.5-style for loops or generics and other stuff if that is already in the code. If any developers have actually read this far (sorry - it got kind of long) - thanks again for all of your great work - Lucene is a great tool - and a great community. OG: Thanks Dan, and please don't take my email(s) wrong. I'm quite clear-headed in this issue, and am trying to be objective. I personally wouldn't get hurt if we stayed with 1.4, I'd just be feeling bad and guilty if we had to reject contributions that have 1.5 bits in it. OG: How about this. I noticed th significant number of people left behind statement in a few people's arguments. How small of a percentage of 1.4 users do you think we should look for before we can ove to 1.5? What does the 1.5:1.4 ration need to be? This is not a question for Dan only. I would really be interested what others think about this. How small does the percentage of 1.4 users need to be, before we can have 1.5 in Lucene? Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- *Chuck Williams* Manawiz Principal V: (808)885-8688 C: (415)846-9018 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Skype: manawiz AIM: hawimanawiz Yahoo: jcwxx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Ray Tsang wrote on 06/19/2006 09:06 AM: On 6/17/06, Chuck Williams [EMAIL PROTECTED] wrote: Ray Tsang wrote on 06/17/2006 06:29 AM: I think the problem right now isn't whether we are going to have 1.5 code or not. We will eventually have to have 1.5 code anyways. But we need a sound plan that will make the transition easy. I believe the transition from 1.4 to 1.5 is not an over night thing. I disagree. 1.5 was specifically designed to make transition easy, including the inclusion of non-features that ensure smooth interoperability (e.g., raw types and no runtime presence whatsoever of generics -- quite different from how it was done in .Net 2.0 for example). But will 1.4 jvm be able to run the new Lucene w/ 1.5 core? If 1.5 features are fully embraced, no. Secondly can we specifically find places where some people _will_ contribute code immediately if it's 1.5 is accepted? I already have. That's what started this second round of debate. What is it? ParallelWriter (see LUCENE-600). I have quite a few more behind that. Whether or not various people will find them useful is tbd, but they are all working well for me and essential to meet my requirements, and some are for things often requested on the various lists (e.g., a general purpose fast bulk index updater that supports arbitrary transformations on the values of fields). Who else? How many? Do we have statistics? We have statistics of number of users between 1.4 vs. 1.5 (which btw didn't present a significant polarization), but how about actual numbers potential of contributions between the 2? There has been a proposal to poll java-dev for this. Wagers on the outcome? Like what I have suggested before, why not have contribution modules that act as a transition into 1.5 code? Much like what other framework has a tiger module. This module may have say, a 1.5 compatible layer on top of 1.4 core, or other components of lucene that was made to be extensible, e.g. 1.5 version of QueryParser, Directory, etc. I think this would make it unnecessarily complex. How is it unnecessary or complex? If it only means layering, extending classes, adding implementations, it should be relatively easy with the existing design. It's something we do everyday regardless what lucene's direction takes. Contributing to Lucene is a volunteer effort. The more difficult you make it, the fewer people will do it. That's what this is all about. Accept 1.5 contributions and I believe you will get more high quality contributions. Of course, this comes at a high cost for those who cannot transition to 1.5, since they would need to stick with Lucene 2.0.x. If I had a vote on this, honestly I'm not sure how I would vote. It's a tough call either way. Do you support a significant minority of users and contributors who are stuck on an old java platform, or do you strike forward with a more robust contributing community from the majority at the cost of cutting out the minority from the latest and greatest? My first comment on this topic was something like, why would somebody who is on an old java platform expect to have the latest and greatest lucene?. I think if I was stuck on 1.4, I wouldn't be happy about a 1.5 decision for lucene 2.1+, but I would understand it, accept it, and do whatever I could to speed my transition to 1.5. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-561) ParallelReader fails on deletes and on seeks of previously unused fields
[ http://issues.apache.org/jira/browse/LUCENE-561?page=comments#action_12416790 ] Chuck Williams commented on LUCENE-561: --- Christian, That is a different bug than this one. This bug has been fixed. Chuck ParallelReader fails on deletes and on seeks of previously unused fields Key: LUCENE-561 URL: http://issues.apache.org/jira/browse/LUCENE-561 Project: Lucene - Java Type: Bug Components: Index Versions: 2.0.0 Environment: All Reporter: Chuck Williams Assignee: Yonik Seeley Fix For: 2.0.0 Attachments: ParallelReaderBugs.patch, ParallelReaderBugs.patch In using ParallelReader I've hit two bugs: 1. ParallelReader.doDelete() and doUndeleteAll() call doDelete() and doUndeleteAll() on the subreaders, but these methods do not set hasChanges. Thus the changes are lost when the readers are closed. The fix is to call deleteDocument() and undeleteAll() on the subreaders instead. 2. ParallelReader discovers the fields in each subindex by using IndexReader.getFieldNames() which only finds fields that have occurred on at least one document. In general a parallel index is designed with assignments of fields to sub-indexes and term seeks (including searches) may be done on any of those fields, even if no documents in a particular state of the index have yet had an assigned field. Seeks/searches on fields that have not yet been indexed generated an NPE in ParallelReader's various inner class seek() and next() methods because fieldToReader.get() returns null on the unseen field. The fix is to extend the add() methods to supply the correct list of fields for each subindex. Patch that corrects both of these issues attached. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-398) ParallelReader crashes when trying to merge into a new index
[ http://issues.apache.org/jira/browse/LUCENE-398?page=comments#action_12416837 ] Chuck Williams commented on LUCENE-398: --- Christian, I'm going to open a new issue on this in order to rename it, post a revised patch, and hopefully get the attention of a committer. Chuck ParallelReader crashes when trying to merge into a new index Key: LUCENE-398 URL: http://issues.apache.org/jira/browse/LUCENE-398 Project: Lucene - Java Type: Bug Components: Index Versions: unspecified Environment: Operating System: All Platform: All Reporter: Sebastian Kirsch Assignee: Lucene Developers Attachments: ParallelReader.diff, ParallelReaderTest1.java, parallelreader.diff, patch-next.diff ParallelReader causes a NullPointerException in org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318) when trying to merge into a new index. See test case and sample output: $ svn diff Index: src/test/org/apache/lucene/index/TestParallelReader.java === --- src/test/org/apache/lucene/index/TestParallelReader.java(revision 179785) +++ src/test/org/apache/lucene/index/TestParallelReader.java(working copy) @@ -57,6 +57,13 @@ } + public void testMerge() throws Exception { +Directory dir = new RAMDirectory(); +IndexWriter w = new IndexWriter(dir, new StandardAnalyzer(), true); +w.addIndexes(new IndexReader[] { ((IndexSearcher) parallel).getIndexReader() }); +w.close(); + } + private void queryTest(Query query) throws IOException { Hits parallelHits = parallel.search(query); Hits singleHits = single.search(query); $ ant -Dtestcase=TestParallelReader test Buildfile: build.xml [...] test: [mkdir] Created dir: /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/build/test [junit] Testsuite: org.apache.lucene.index.TestParallelReader [junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 1.993 sec [junit] Testcase: testMerge(org.apache.lucene.index.TestParallelReader): Caused an ERROR [junit] null [junit] java.lang.NullPointerException [junit] at org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318) [junit] at org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294) [junit] at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596) [junit] at org.apache.lucene.index.TestParallelReader.testMerge(TestParallelReader.java:63) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] Test org.apache.lucene.index.TestParallelReader FAILED BUILD FAILED /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/common-build.xml:188: Tests failed! Total time: 16 seconds $ -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-607) ParallelTermEnum is BROKEN
[ http://issues.apache.org/jira/browse/LUCENE-607?page=all ] Chuck Williams updated LUCENE-607: -- Attachment: ParallelTermEnum.patch ParallelTermEnum is BROKEN -- Key: LUCENE-607 URL: http://issues.apache.org/jira/browse/LUCENE-607 Project: Lucene - Java Type: Bug Components: Index Versions: 2.0.0 Reporter: Chuck Williams Priority: Critical Attachments: ParallelTermEnum.patch ParallelTermEnum.next() fails to advance properly to new fields. This is a serious bug. Christian Kohlschuetter diagnosed this as the root problem underlying LUCENE-398 and posted a first patch there. I've addressed a couple issues in the patch (close skipped field TermEnum's, generate field iterator only once, integrated Christian's test case as a Lucene test) and packaged in all the revised patch here. All Lucene tests pass, and I've further tested in this in my app, which makes extensive use of ParallelReader. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-398) ParallelReader crashes when trying to merge into a new index
[ http://issues.apache.org/jira/browse/LUCENE-398?page=comments#action_12416838 ] Chuck Williams commented on LUCENE-398: --- Revised patch posted in LUCENE-607 ParallelReader crashes when trying to merge into a new index Key: LUCENE-398 URL: http://issues.apache.org/jira/browse/LUCENE-398 Project: Lucene - Java Type: Bug Components: Index Versions: unspecified Environment: Operating System: All Platform: All Reporter: Sebastian Kirsch Assignee: Lucene Developers Attachments: ParallelReader.diff, ParallelReaderTest1.java, parallelreader.diff, patch-next.diff ParallelReader causes a NullPointerException in org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318) when trying to merge into a new index. See test case and sample output: $ svn diff Index: src/test/org/apache/lucene/index/TestParallelReader.java === --- src/test/org/apache/lucene/index/TestParallelReader.java(revision 179785) +++ src/test/org/apache/lucene/index/TestParallelReader.java(working copy) @@ -57,6 +57,13 @@ } + public void testMerge() throws Exception { +Directory dir = new RAMDirectory(); +IndexWriter w = new IndexWriter(dir, new StandardAnalyzer(), true); +w.addIndexes(new IndexReader[] { ((IndexSearcher) parallel).getIndexReader() }); +w.close(); + } + private void queryTest(Query query) throws IOException { Hits parallelHits = parallel.search(query); Hits singleHits = single.search(query); $ ant -Dtestcase=TestParallelReader test Buildfile: build.xml [...] test: [mkdir] Created dir: /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/build/test [junit] Testsuite: org.apache.lucene.index.TestParallelReader [junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 1.993 sec [junit] Testcase: testMerge(org.apache.lucene.index.TestParallelReader): Caused an ERROR [junit] null [junit] java.lang.NullPointerException [junit] at org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318) [junit] at org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294) [junit] at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) [junit] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) [junit] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) [junit] at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596) [junit] at org.apache.lucene.index.TestParallelReader.testMerge(TestParallelReader.java:63) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] Test org.apache.lucene.index.TestParallelReader FAILED BUILD FAILED /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/common-build.xml:188: Tests failed! Total time: 16 seconds $ -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Soccer-themed question: null fields?
JMA wrote on 06/17/2006 10:16 PM: 1) Is there a way to find a document that has null fields? For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup players: FIRST_NAME: Brian LAST_NAME: McBride FIRST_NAME: Agustin LAST_NAME: Delgado FIRST_NAME: Zinha LAST_NAME: (null or blank) FIRST_NAME: Kaka LAST_NAME: (null or blank) ... and so on What's the way to find all players that use only their first name? By far the best way is to store a special token into null fields and then just match on this. One less-performant alternative if you have no control over the index is to enable prefix wildcard queries and then write a query like this: FIRST_NAME:* -LAST_NAME:* To enable prefix wildcard queries, you need to regenerate QueryParser.java from QueryParser.jj after replacing the wildcard production (search for OG, as Otis has nicely included the appropriate production as a comment). 2) Is there a way to count field terms? For example, if instead we have one field... NAME: Brian McBride NAME: Agustin Delgado NAME: Zinha NAME: Kaka Can I answer the same question by finding all documents where the number of terms in the NAME field is 1 and only 1? Is there a way to do that? You would need to write your own Query subclass, and I can't think of any way to achieve this that would not be very slow. Not recommended. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Tatu Saloranta wrote on 06/17/2006 06:54 AM: And it's bit curious as to what the current mad rush regarding migration is -- beyond the convenience and syntactic sugar, only the concurrency package seems like a tempting immediate reason? The only people who keep bringing up these non-arguments are those on the con side. You should read the arguments on the pro side -- they are not these. I hope it can be a practical decision made with cool minds. Agreed. I think a key part of this is to listen to what the other side is saying. This all boils down to people and the environments they use. People using 1.4 want the latest and greatest Lucene and don't understand why it's important to use 1.5 anyway. People using 1.5 are writing 1.5 code everyday and want to be able to make contributions to Lucene without backporting and retesting. Also, they don't want to consciously write code that might be a Lucene contribution in 1.4 because a) the cognitive shift back to 1.4 is not easy once you are fully indoctrinated into 1.5 (primarily generics), and b) 1.4 code is not type-safe in the sense that 1.5 code is. So, do 1.4 people live with Lucene 2.0.x until they move to 1.5, or do 1.5 people get limited or cut out from making contributions. Neither option is appealing, especially to those negative affected. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage
[ http://issues.apache.org/jira/browse/LUCENE-602?page=all ] Chuck Williams updated LUCENE-602: -- Attachment: TokenSelectorSoloAll.patch TokenSelectorSoloAll.patch applies against today's svn head. It only requires Java 1.4. [PATCH] Filtering tokens for position and term vector storage - Key: LUCENE-602 URL: http://issues.apache.org/jira/browse/LUCENE-602 Project: Lucene - Java Type: New Feature Components: Index Versions: 2.1 Reporter: Chuck Williams Attachments: TokenSelectorSoloAll.patch This patch provides a new TokenSelector mechanism to select tokens of interest and creates two new IndexWriter configuration parameters: termVectorTokenSelector and positionsTokenSelector. termVectorTokenSelector, if non-null, selects which index tokens will be stored in term vectors. If positionsTokenSelector is non-null, then any tokens it rejects will have only their first position stored in each document (it is necessary to store one position to keep the doc freq properly to avoid the token being garbage collected in merges). This mechanism provides a simple solution to the problem of minimzing index size overhead cause by storing extra tokens that facilitate queries, in those cases where the mere existence of the extra tokens is sufficient. For example, in my test data using reverse tokens to speed prefix wildcard matching, I obtained the following index overheads: 1. With no TokenSelectors: 60% larger with reverse tokens than without 2. With termVectorTokenSelector rejecting reverse tokens: 36% larger 3. With both positionsTokenSelector and termVectorTokenSelector rejecting reverse tokens: 25% larger It is possible to obtain the same effect by using a separate field that has one occurrence of each reverse token and no term vectors, but this can be hard or impossible to do and a performance problem as it requires either rereading the content or storing all the tokens for subsequent processing. The solution with TokenSelectors is very easy to use and fast. Otis, thanks for leaving a comment in QueryParser.jj with the correct production to enable prefix wildcards! With this, it is a straightforward matter to override the wildcard query factory method and use reverse tokens effectively. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage
[ http://issues.apache.org/jira/browse/LUCENE-602?page=all ] Chuck Williams updated LUCENE-602: -- Attachment: TokenSelectorAllWithParallelWriter.patch TokenSelectorAllWithParallelWriter.patch contains ParallelWriter as well (LUCENE-600) as it is also affected. [PATCH] Filtering tokens for position and term vector storage - Key: LUCENE-602 URL: http://issues.apache.org/jira/browse/LUCENE-602 Project: Lucene - Java Type: New Feature Components: Index Versions: 2.1 Reporter: Chuck Williams Attachments: TokenSelectorAllWithParallelWriter.patch, TokenSelectorSoloAll.patch This patch provides a new TokenSelector mechanism to select tokens of interest and creates two new IndexWriter configuration parameters: termVectorTokenSelector and positionsTokenSelector. termVectorTokenSelector, if non-null, selects which index tokens will be stored in term vectors. If positionsTokenSelector is non-null, then any tokens it rejects will have only their first position stored in each document (it is necessary to store one position to keep the doc freq properly to avoid the token being garbage collected in merges). This mechanism provides a simple solution to the problem of minimzing index size overhead cause by storing extra tokens that facilitate queries, in those cases where the mere existence of the extra tokens is sufficient. For example, in my test data using reverse tokens to speed prefix wildcard matching, I obtained the following index overheads: 1. With no TokenSelectors: 60% larger with reverse tokens than without 2. With termVectorTokenSelector rejecting reverse tokens: 36% larger 3. With both positionsTokenSelector and termVectorTokenSelector rejecting reverse tokens: 25% larger It is possible to obtain the same effect by using a separate field that has one occurrence of each reverse token and no term vectors, but this can be hard or impossible to do and a performance problem as it requires either rereading the content or storing all the tokens for subsequent processing. The solution with TokenSelectors is very easy to use and fast. Otis, thanks for leaving a comment in QueryParser.jj with the correct production to enable prefix wildcards! With this, it is a straightforward matter to override the wildcard query factory method and use reverse tokens effectively. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Java 1.5 was [jira] Updated: (LUCENE-600) ParallelWriter companion to ParallelReader
I think the last discussion ended with the main counter-argument being lack of support by gjc. Current top of GJC News: *June 6, 2006* RMS approved the plan to use the Eclipse compiler as the new gcj front end. Work is being done on the |gcj-eclipse| branch; it can already build libgcj. This project will allow us to ship a 1.5 compiler in the relatively near future. The old |gcjx| branch and project is now dead. In addition to performance, productivity and functionality benefits, my main argument for 1.5 is that it is used by the vast majority of lucene community members. Everything I write is in 1.5 and I don't have time to backport. I have a significant body of code from which to extract and contribute patches that others would likely find useful. How many others are in a similar position? On the side, not leaving valued community members behind is important. I think the pmc / committers just need to make a decision which will impact one group or the other. Chuck Grant Ingersoll wrote on 06/13/2006 03:35 AM: Well, we have our first Java 1.5 patch... Now that we have had a week or two to digest the comments, do we want to reopen the discussion? Chuck Williams (JIRA) wrote: [ http://issues.apache.org/jira/browse/LUCENE-600?page=all ] Chuck Williams updated LUCENE-600: -- Attachment: ParallelWriter.patch Patch to create and integrate ParallelWriter, Writable and TestParallelWriter -- also modifies build to use java 1.5. ParallelWriter companion to ParallelReader -- Key: LUCENE-600 URL: http://issues.apache.org/jira/browse/LUCENE-600 Project: Lucene - Java Type: Improvement Components: Index Versions: 2.1 Reporter: Chuck Williams Attachments: ParallelWriter.patch A new class ParallelWriter is provided that serves as a companion to ParallelReader. ParallelWriter meets all of the doc-id synchronization requirements of ParallelReader, subject to: 1. ParallelWriter.addDocument() is synchronized, which might have an adverse effect on performance. The writes to the sub-indexes are, however, done in parallel. 2. The application must ensure that the ParallelReader is never reopened inside ParallelWriter.addDocument(), else it might find the sub-indexes out of sync. 3. The application must deal with recovery from ParallelWriter.addDocument() exceptions. Recovery must restore the synchronization of doc-ids, e.g. by deleting any trailing document(s) in one sub-index that were not successfully added to all sub-indexes, and then optimizing all sub-indexes. A new interface, Writable, is provided to abstract IndexWriter and ParallelWriter. This is in the same spirit as the existing Searchable and Fieldable classes. This implementation uses java 1.5. The patch applies against today's svn head. All tests pass, including the new TestParallelWriter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: How to combine results from several indices
You can try that approach, but I think you will find it more difficult. E.g., all of the primitive query classes are written specifically to use doc-ids. So, you either need to do you searches separately on each subindex and then write your own routine to join the results, or you would need to rewrite all the queries. I use two different indexing combining techniques: 1. ParallelReader/ParallelWriter for performance reasons in various circumstances; e.g., fast access to frequently used fields (in combination with lazy fields -- very useful for fast categorical analysis of large samples), fast bulk updates of mutable fields by copying a much smaller subindex, etc. 2. Subindex query rewriting for accessing different types of objects in separate indices. A query on the main index may contain a subquery that retrieves objects in a different index and rewrites itself into a disjunction of the uid's of those objects. This approach works well assuming you can arrange indexing of fields in the main index with subindex uid values, and the disjunction expansions are not too large. Maybe approach 2 is more what you need? It's pretty simple to do. E.g., take a look at MultiTermQuery for a non-primitive query that rewrites itself dependent on the index. You need a similar class that rewrites itself dependent on a different index. Chuck wu fox wrote on 06/13/2006 02:18 AM: thank you very much Chuck.But I still wondered is there any way that I can revise ParallelReader so that it do not need the same doc id .Can IndexReader comebine different doc according some mapping rules ?for example I can override Document method that combine docs from indices acoording to same uuid or override some other methods,I think it is much easier to do than a writer:) Thank you for your help again. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: How to combine results from several indices
Hi Wu, The simplest solution is to synchronize calls to a ParallelWriter.addDocument() method that calls IndexWriter.addDocument() for each sub-index. This will work assuming there are no exceptions and assuming you never refresh your IndexReader within ParallelWriter.addDocument(). If exceptions occur writing one of the sub-indexes, then you need to recover them. The best approach I've found is to delete the unequal final subdocuments and optimize all the subindexes to restore equal doc ids. This approach has the consequence of single-threading all index writing. I'm working on a solution to avoid this, but it may require deeper integration into the higher level IndexManager mechanism (which does reader reopening, journaling, recovery, and a lot of other things). If you can get by with single threading, I have a ParallelWriter class now that I could contribute. If not, I'm considering the more general solution now, but will only be able to contribute it if it can be kept separate from the much larger IndexManager mechanism (which is more specific to my app and thus not likely a fit for your app anyway). Chuck wu fox wrote on 06/12/2006 02:43 AM: Hi Chuck: I am still looking forward to a solution which ensure to to meet the constraints of ParallelReader so that I can use it for my seach programm. I have tried a lot of methods but none of them is good enough for me because of obvious bugs. Can you help me? thanks in advance - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]