The 2GB segment size limit
Hi, Recently an index I've been building passed the 2 GB mark, and after I optimize()ed it into one segment over 2 GB, it stopped working. Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ question Is there a way to limit the size of an index. My first problem is that it looks to me like this FAQ entry is passing outdated advice. My second second problem is that we document a bug, instead of fixing it. The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs(). This solution has two serious problems: First, normally one doesn't know how many documents one can index before reaching 2 GB, and second, a call to optimize() appears to ignore this setting and merge everything again - no good! The second solution the FAQ recommends (using MultiSearcher) is unwieldy and in my opinion, should be unnecessary (since we have the concept of segments, why do we need separate indices in that case?). The third option labeled the optimal solution is to write a new FSDirectory implementation that represents files over 2 GB as several files, broken on the 2 GB mark. But has anyone ever implemented this? Does anyone have any experience with the 2 GB problem? Is one of these recommendations *really* the recommended solution? What about the new LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better to use that? Does anybody know if optimize() also obeys this flag? If not, shouldn't it? In short, I'd like to understand the best practices of solving the 2 GB problem, and improve the FAQ in this regard. Moreover, I wonder, instead of documenting around the problem, should we perhaps make the default behavior more correct? In other words, imagine that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023, to be on the safe side?). Then, segments larger than 1 GB will never be merged with anything else. Some users (with multi-gigabyte indices on a 64 bit CPU) may not like this default, but they can change it - at least with this default Lucene's behavior will be correct on all CPUs and JVMs. I have one last question that I wonder if anyone can answer before I start digging into the code. We use merges not just for merging segments, but also as an oportunity to clean up segments from deleted documents. If some segment is bigger than the maximum and is never merged again, does this also mean deleted documents will never ever get cleaned up from it? This can be a serious problem on huge dynamic indices (e.g., imagine a crawl of the Web or some large intranet). Nowadays, 2 GB indices are less rare than they used to be, and 32 bit JVMs are still quite common, so I think this is a problem we should solve properly. Thanks, Nadav. -- Nadav Har'El|Wednesday, Jun 25 2008, 22 Sivan 5768 [EMAIL PROTECTED] |- Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps http://nadav.harel.org.il |minutes and wastes hours. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: changing index format
Op Wednesday 25 June 2008 07:03:59 schreef John Wang: Hi guys: Perhaps I should have posted this to this list in the first place. I am trying to work on a patch to for each term, expose minDoc and maxDoc. This value can be retrieve while constructing the TermInfo. Knowing these two values can be very helpful in caching DocIdSet for a given Term. This would help to determine what type of underlying implementation to use, e.g. BitSet, HashSet, or ArraySet, etc. I suppose you know about https://issues.apache.org/jira/browse/LUCENE-1296 ? But how about using TermScorer? In the trunk it's a subclass of DocIdSetIterator (via Scorer) and the caching is already done by Lucene and the underlying OS file cache. TermScorer does some extra work for its scoring, but I don't think that would affect performance. The problem I am having is stated below, I don't know how to add the minDoc and maxDoc values to the index while keeping backward compatibility. I doubt they would help very much. The most important info for this is maxDoc from the index reader and the document frequency of the term, and these are easily determined. Btw, I've just started to add encoding intervals of consecutive doc ids to SortedVIntList. For very high document frequencies, that might actually be faster than TermScorer and more compact than the current index. Once I've got some working code I'll open an issue for it. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The 2GB segment size limit
Nadav Har'El wrote: Recently an index I've been building passed the 2 GB mark, and after I optimize()ed it into one segment over 2 GB, it stopped working. Nadav, which platform did you hit this on? I think I've created 2 GB index on 32 bit WinXP just fine. How many platforms are really affected by this? Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ question Is there a way to limit the size of an index. My first problem is that it looks to me like this FAQ entry is passing outdated advice. My second second problem is that we document a bug, instead of fixing it. The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs(). This solution has two serious problems: First, normally one doesn't know how many documents one can index before reaching 2 GB, and second, a call to optimize() appears to ignore this setting and merge everything again - no good! And a 3rd problem is: that limit applies to the input segments (to the merge), not the output segment. So the eg given of setting maxMergeDocs to 7M is very likely too high because if you merge 10 segments, each 7M docs, you'll likely easily get a resulting segment 2 GB. The second solution the FAQ recommends (using MultiSearcher) is unwieldy and in my opinion, should be unnecessary (since we have the concept of segments, why do we need separate indices in that case?). The third option labeled the optimal solution is to write a new FSDirectory implementation that represents files over 2 GB as several files, broken on the 2 GB mark. But has anyone ever implemented this? I agree these two workarounds sound quite challenging to do in practice... Does anyone have any experience with the 2 GB problem? Is one of these recommendations *really* the recommended solution? What about the new LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better to use that? Does anybody know if optimize() also obeys this flag? If not, shouldn't it? optimize() doesn't obey it, and the same problem (input vs output) applies to maxMergeMB as well. To make optimize() obey these limits, one would have to make their own MergePolicy. In short, I'd like to understand the best practices of solving the 2 GB problem, and improve the FAQ in this regard. Moreover, I wonder, instead of documenting around the problem, should we perhaps make the default behavior more correct? In other words, imagine that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023, to be on the safe side?). Then, segments larger than 1 GB will never be merged with anything else. Some users (with multi-gigabyte indices on a 64 bit CPU) may not like this default, but they can change it - at least with this default Lucene's behavior will be correct on all CPUs and JVMs. I think we should understand how widespread this really is in our userbase. If it's a minority being affected by it, I think the current defaults are correct (and, it's this minority that should change Lucene to not produce too large a segment). I have one last question that I wonder if anyone can answer before I start digging into the code. We use merges not just for merging segments, but also as an oportunity to clean up segments from deleted documents. If some segment is bigger than the maximum and is never merged again, does this also mean deleted documents will never ever get cleaned up from it? This can be a serious problem on huge dynamic indices (e.g., imagine a crawl of the Web or some large intranet). Right, the deletes will not be cleaned up. But you can use expungeDeletes()? Or, make a MergePolicy that favors merges that would clean up deletes. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ReaderCommit
Jason Rutherglen wrote: For Ocean I created a workaround where the IndexCommits from IndexDeletionPolicy are saved in a map in order to achieve deleting based on the IndexReader. It would be more straightforward to delete from the IndexCommit in IndexReader. It seems like we are mixing up deleting a whole commit point, vs deleting individual documents? Or does Ocean somehow decide to delete a whole commit point based on which documents have been deleted? I realize people want to get away from IndexReader performing updates, however, for my use case, realtime search updating from IndexReader makes sense mainly for obtaining the doc ids of deletions. With IndexWriter managing the merges it would seem difficult to expose doc numbers, but perhaps there is a way. IndexWriter can now delete by query, but it sounds like that's not sufficient for Ocean? Under the hood, IndexWriter has the infrastructure to hold pending deleted docIDs and update these docIDs when a merge is committed. Ie, previously we forced a flush of all pending deletes on every flush/ merge, but now we buffer the docIDs across flushes/merges. This means IndexWriter *could* delete by docID, however, none of this is exposed publicly. Also, this doesn't solve the problem of how you would get the docIDs to delete in the first place (ie one must still use a separate IndexReader for that). I'm not sure this helps you (Ocean) since you presumably need to flush deletes very quickly to have realtime search... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
Jason Rutherglen wrote: One of the bottlenecks I have noticed testing Ocean realtime search is the delete process which involves writing several files for each possibly single delete of a document in SegmentReader. The best way to handle the deletes is too simply keep them in memory without flushing them to disk, saving on writing out an entire BitVector per delete. The deletes are saved in the transaction log which is be replayed on recovery. I am not sure of the best way to approach this, perhaps it is creating a custom class that inherits from SegmentReader. It could reuse the existing reopen and also provide a way to set the deletedDocs BitVector. Also it would be able to reuse FieldsReader by providing locking around FieldsReader for all SegmentReaders of the segment to use. Otherwise in the current architecture each new SegmentReader opens a new FieldsReader which is non-optimal. The deletes would be saved to disk but instead of per delete, periodically like a checkpoint. Or ... maybe you could do the deletes through IndexWriter (somehow, if we can get docIDs properly) and then SegmentReaders could somehow tap into the buffered deleted docIDs that IndexWriter already maintains. IndexWriter is already doing this buffering, flush/commit anyway. We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed. So if we had this IndexReader impl, plus extended SegmentReader so it could tap into pending deletes buffered in IndexWriter, you could get realtime search without having to use Directory as an intermediary. Though, it is clearly quite a bit more work :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: changing index format
John Wang wrote: The problem I am having is stated below, I don't know how to add the minDoc and maxDoc values to the index while keeping backward compatibility. Unfortunately, TermInfo file format just isn't extensible at the moment, so I think for now you'll have to break backward compatibility if you really want to store these new fields in the _X.tis/.tii files. EG here is another recent example of wanting to alter what's stored in TermInfo: https://issues.apache.org/jira/browse/LUCENE-1278 For flexible indexing we clearly need to fix this, so that any plugin in the indexing chain could stuff whatever it wants into the TermInfo, and also override how TermInfo is read/written. Even the things we now store in TermInfo should be optional. EG say you choose not to store locations (prx) for a given field. Then, you would not need the long proxPointer. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12607954#action_12607954 ] Michael McCandless commented on LUCENE-1314: bq. In my SegmentReader subclass I am passing a lock and passing a reference to fieldsReader for global locking and a single fieldsReader across all instances. Otherwise there are too many instances of fieldsReader and file descriptors will be used up. Maybe instead we should just fix access to FieldsReader to be thread safe, either by making FieldsReader itself thread safe, or by doing something similar to what's done for TermVectorsReader (where each thread makes a shallow clone of the original TermVectorsReader, held in a ThreadLocal instance). If we do that, then in SegmentReader.doReopen() we never have to clone FieldsReader. IndexReader.reopen(boolean force) - Key: LUCENE-1314 URL: https://issues.apache.org/jira/browse/LUCENE-1314 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Attachments: lucene-1314.patch, lucene-1314.patch, lucene-1314.patch Based on discussion http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem is reopen returns the same reader if there are no changes, so if docs are deleted from the new reader, they are also reflected in the previous reader which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
I understand what you are saying. I am not sure it is worth clearly quite a bit more work given how easy it is to simply be able to have more control over the IndexReader deletedDocs BitVector which seems like a feature that should be in there anyways, perhaps even allowing SortedVIntList to be used. The other issue with going down the path of integrating too much with IndexWriter is I am not sure how to integrate the realtime document additions to IndexWriter which is handled best by InstantiatedIndex. When merging needs to happen in Ocean the IndexWriter.addIndexes(IndexReader[] readers) is used to merge SegmentReaders and InstantiatedIndexReaders. One of the things I do not understand about IndexWriter deletes is it does not reuse an already open TermInfosReader with the tii loaded. Isn't this slower than deleting using an already open IndexReader? In any case the method of using deletedDocs in SegmentReader using the patch given seems to work quite well in Ocean now. I think long term there is probably some way to integrate more with IndexWriter, but really that is something more in line with removing the concept of IndexReader and IndexWriter and creating an IndexReaderWriter class. On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: Jason Rutherglen wrote: One of the bottlenecks I have noticed testing Ocean realtime search is the delete process which involves writing several files for each possibly single delete of a document in SegmentReader. The best way to handle the deletes is too simply keep them in memory without flushing them to disk, saving on writing out an entire BitVector per delete. The deletes are saved in the transaction log which is be replayed on recovery. I am not sure of the best way to approach this, perhaps it is creating a custom class that inherits from SegmentReader. It could reuse the existing reopen and also provide a way to set the deletedDocs BitVector. Also it would be able to reuse FieldsReader by providing locking around FieldsReader for all SegmentReaders of the segment to use. Otherwise in the current architecture each new SegmentReader opens a new FieldsReader which is non-optimal. The deletes would be saved to disk but instead of per delete, periodically like a checkpoint. Or ... maybe you could do the deletes through IndexWriter (somehow, if we can get docIDs properly) and then SegmentReaders could somehow tap into the buffered deleted docIDs that IndexWriter already maintains. IndexWriter is already doing this buffering, flush/commit anyway. We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed. So if we had this IndexReader impl, plus extended SegmentReader so it could tap into pending deletes buffered in IndexWriter, you could get realtime search without having to use Directory as an intermediary. Though, it is clearly quite a bit more work :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed. That would be very nice, and should also make it much easier to implement updateable documents (changing/adding/removing single fields). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608039#action_12608039 ] Jason Rutherglen commented on LUCENE-1314: -- Here is the code of the SegmentReader subclass. Using the clone terminology would work as well, inside of SegmentReader the clone would most likely reuse SegmentReader.reopenSegment. The subclass turns off locking by overriding acquireWriteLock and having it do nothing. I do not know a general fix for the locking issue mentioned it holds a lock and then you can't do deletions in the second object. Perhaps there is a way using lock less commits. It is possible to have SegmentReader implement if deletes occur to an earlier IndexReader and a flush is tried it fails, rather than fail in a newer IndexReader like it would now. This would require keeping track of later IndexReaders which is something Ocean does outside of IndexReader. As far as the FieldsReader, given how many SegmentReaders Ocean creates (up to one per update), a shallow clone threadlocal would still potentially create many file descriptors. I would rather see a synchronized FieldsReader, or simply use the approach in the code below. The external lock used seems ok because there is little competition for reading Documents, no more than normal a Lucene application using a single IndexReader loading documents for N results. {code} public class OceanSegmentReader extends SegmentReader { protected ReentrantLock fieldsReaderLock; public OceanSegmentReader() { openNewFieldsReader = false; } protected void doInitialize() { fieldsReaderLock = new ReentrantLock(); } protected void acquireWriteLock() throws IOException { } protected synchronized DirectoryIndexReader doReopen(SegmentInfos infos, boolean force) throws CorruptIndexException, IOException { OceanSegmentReader segmentReader = (OceanSegmentReader)super.doReopen(infos, force); segmentReader.fieldsReaderLock = fieldsReaderLock; return segmentReader; } /** * @throws CorruptIndexException * if the index is corrupt * @throws IOException * if there is a low-level IO error */ public synchronized Document document(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException { ensureOpen(); if (isDeleted(n)) throw new IllegalArgumentException(attempt to access a deleted document); fieldsReaderLock.lock(); try { return getFieldsReader().doc(n, fieldSelector); } finally { fieldsReaderLock.unlock(); } } } {code} IndexReader.reopen(boolean force) - Key: LUCENE-1314 URL: https://issues.apache.org/jira/browse/LUCENE-1314 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Attachments: lucene-1314.patch, lucene-1314.patch, lucene-1314.patch Based on discussion http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem is reopen returns the same reader if there are no changes, so if docs are deleted from the new reader, they are also reflected in the previous reader which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: per-field similarity
+1 24 jun 2008 kl. 22.28 skrev Yonik Seeley: Something to consider for Lucene 3 is to have something to retrieve Similarity per-field rather than passing the field name into some functions... benefits: - Would allow customizing most Similarity functions per-field - Performance: Similarity for a field could be looked up once at the beginning of a query and reused, eliminating hash lookups for every Similarity function called that needs to be different depending on the field name. Might also consider passing in more optional context when retrieving the similarity for a field (such as a Query, if searching). Something like Similarity.getSimilarity(String field, Query q). Multi-field queries (boolean query) could pass null for the field. Perhaps it could even be back compatible... -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is there a reason MemoryIndex does not implement Serializable?
It seems like it could, it even has serialVersionUID defined.
Re: Fwd: changing index format
Thanks Paul and Mike for the feedback. Paul, for us, sparsity of the docIds determine which data structure to use. Where cardinality gives some of that, min/max docId would also help, example: say maxdoc=100, cardinality = 7, docids: {0,1,...6} or {3,4...9}, using arrayDocIdSet would take 28 bytes and bitset would take only 1. Furthermore, knowing min/maxDocId would help in predetermine the size needed in construction of a given DocIdSet datastructure, to avoid growth. Thanks for pointing me to SortedVIntList, what is the underlying compression algorithm? We have developed a DocIdSet implementation using the a variation of the P4Delta compression algorithm ( http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to contribute sometime. From our benchmark, we get about 70% compression (30% of the original size) of arrays, which also give you iteration in compressed format with performance similar to OpenBitSet. (Iterating over arrays is much faster over OpenBitSet) I am not sure TermScorer serves the purpose here. TermScorer reads a batch of 32 at a time (don't understand why 32 is picked or should it be customizable), we can't rely on getting lucky to have the underlying OS cache for us. Many times, we want to move the construction of some filters ahead while the IndexReader reads. Here is an example, say we have a field called: gender with only 2 terms: M, F. And our query is always of the form content:query text AND gender:M/F, it is ideal to keep DocIdSet for M and F in memory for the life of the IndexReader. I can't imagine constructing a TermScorer for each query is similar in performance. Reading the trunk code for TermScorer, I don't see the internal termDocs is closed in skipTo. skipTo returns a boolean which tells the caller if the end is reached, the caller may not/should not call next again to have it closed. So wouldn't this scenario leak? Also in explain(docid), what happens if termDoc is already closed from the next() call? Thanks -John On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot [EMAIL PROTECTED] wrote: Op Wednesday 25 June 2008 07:03:59 schreef John Wang: Hi guys: Perhaps I should have posted this to this list in the first place. I am trying to work on a patch to for each term, expose minDoc and maxDoc. This value can be retrieve while constructing the TermInfo. Knowing these two values can be very helpful in caching DocIdSet for a given Term. This would help to determine what type of underlying implementation to use, e.g. BitSet, HashSet, or ArraySet, etc. I suppose you know about https://issues.apache.org/jira/browse/LUCENE-1296 ? But how about using TermScorer? In the trunk it's a subclass of DocIdSetIterator (via Scorer) and the caching is already done by Lucene and the underlying OS file cache. TermScorer does some extra work for its scoring, but I don't think that would affect performance. The problem I am having is stated below, I don't know how to add the minDoc and maxDoc values to the index while keeping backward compatibility. I doubt they would help very much. The most important info for this is maxDoc from the index reader and the document frequency of the term, and these are easily determined. Btw, I've just started to add encoding intervals of consecutive doc ids to SortedVIntList. For very high document frequencies, that might actually be faster than TermScorer and more compact than the current index. Once I've got some working code I'll open an issue for it. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is there a reason MemoryIndex does not implement Serializable?
No reason done! Erik On Jun 25, 2008, at 11:05 AM, Jason Rutherglen wrote: It seems like it could, it even has serialVersionUID defined. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
I read other parts of the email but glanced over this part. Would terms be automatically sorted as they came in? If implemented it would be nice to be able to get an encoded representation (probably byte array) of the document and postings which could be written to a log, and then reentered in another IndexWriter recreating the document and postings. On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley [EMAIL PROTECTED] wrote: On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed. That would be very nice, and should also make it much easier to implement updateable documents (changing/adding/removing single fields). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Feak updated LUCENE-1316: -- Further investigation indicates that the ValueSourceQuery$ValueSourceScorer may suffer from the same issue and benefit from a similar optimization. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: changing index format
Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery: Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery class that honors the Query api contracts. Given these tools, we are able to build a customized scored BooleanQuery-like query infrastructure. We'd be happy to contribute them. Thanks -John On Wed, Jun 25, 2008 at 9:29 AM, Paul Elschot [EMAIL PROTECTED] wrote: Op Wednesday 25 June 2008 17:05:17 schreef John Wang: Thanks Paul and Mike for the feedback. Paul, for us, sparsity of the docIds determine which data structure to use. Where cardinality gives some of that, min/max docId would also help, example: say maxdoc=100, cardinality = 7, docids: {0,1,...6} or {3,4...9}, using arrayDocIdSet would take 28 bytes and bitset would take only 1. Furthermore, knowing min/maxDocId would help in predetermine the size needed in construction of a given DocIdSet datastructure, to avoid growth. Thanks for pointing me to SortedVIntList, what is the underlying compression algorithm? A SortedVIntList uses a byte array to store the docid differences as a series of VInt's, with a VInt being a series of bytes in which the high bit is a continuation bit, and the remaining bits are data for an unsigned integer. The same VInt is used in a lucene index in various places. We have developed a DocIdSet implementation using the a variation of the P4Delta compression algorithm ( http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to contribute sometime. From our benchmark, we get about 70% compression (30% of the original size) of arrays, which also give you iteration in compressed format with performance similar to OpenBitSet. (Iterating over arrays is much faster over OpenBitSet) Andrzej recently pointed to a paper on PForDelta, and since then I have a java implementation rather low on my todo list. Needless to say that I'm interested to see it contributed. I am not sure TermScorer serves the purpose here. TermScorer reads a batch of 32 at a time (don't understand why 32 is picked or should it be customizable), we can't rely on getting lucky to have the underlying OS cache for us. Many times, we want to move the construction of some filters ahead while the IndexReader reads. Here is an example, say we have a field called: gender with only 2 terms: M, F. And our query is always of the form content:query text AND gender:M/F, it is ideal to keep DocIdSet for M and F in memory for the life of the IndexReader. I can't imagine constructing a TermScorer for each query is similar in performance. Well, you can give TermScorer a try before writing other code. Adding a DocIdSet as required or prohibited to a BooleanQuery would be nice, but that is not yet possible. Reading the trunk code for TermScorer, I don't see the internal termDocs is closed in skipTo. skipTo returns a boolean which tells the caller if the end is reached, the caller may not/should not call next again to have it closed. So wouldn't this scenario leak? Closing of Scorers has been discussed before, the only conclusion I remember now is that there is no bug in the current code. Also in explain(docid), what happens if termDoc is already closed from the next() call? When explain() is called on a Scorer, next() and skipTo() should not be called. A Scorer can either explain, or search, but not both. Regards, Paul Elschot Thanks -John On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot [EMAIL PROTECTED] wrote: Op Wednesday 25 June 2008 07:03:59 schreef John Wang: Hi guys: Perhaps I should have posted this to this list in the first place. I am trying to work on a patch to for each term, expose minDoc and maxDoc. This value can be retrieve while constructing the TermInfo. Knowing these two values can be very helpful in caching DocIdSet for a given Term. This would help to determine what type of underlying implementation to use, e.g. BitSet, HashSet, or ArraySet, etc. I suppose you know about https://issues.apache.org/jira/browse/LUCENE-1296 ? But how about using TermScorer? In the trunk it's a subclass of DocIdSetIterator (via Scorer) and the caching is already done by Lucene and the underlying OS file cache. TermScorer does some extra work for its scoring, but I don't think that would affect performance. The problem I am having is stated below, I don't know how to add the minDoc and maxDoc values to the index while keeping backward compatibility. I doubt they would help very much. The most important info for this is maxDoc from the index reader and the document frequency of the term, and these are easily determined. Btw, I've just started to add
Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
On Wed, Jun 25, 2008 at 11:30 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: I read other parts of the email but glanced over this part. Would terms be automatically sorted as they came in? If implemented it would be nice to be able to get an encoded representation (probably byte array) of the document and postings which could be written to a log, and then reentered in another IndexWriter recreating the document and postings. I was talking simpler... If one could open an IndexReader on the index (including uncommitted documents in the open IndexWriter), then you can easily search for a document and retrieve it's stored fields in order to re-index it with changes (and still maintain decent performance). -Yonik On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley [EMAIL PROTECTED] wrote: On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed. That would be very nice, and should also make it much easier to implement updateable documents (changing/adding/removing single fields). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BooleanQuery and DocIdSet; Was: Fwd: changing index format
Op Wednesday 25 June 2008 18:45:16 schreef John Wang: Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery: Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery class that honors the Query api contracts. Given these tools, we are able to build a customized scored BooleanQuery-like query infrastructure. We'd be happy to contribute them. Another thing to be removed from near the end of my todo list? Perhaps I could even take a vacation :) More seriously: would DocIdSetQuery be superfluous when a DocIdSet could be added directly to a BooleanQuery? Could you elaborate a bit on the customized scoring? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608128#action_12608128 ] Yonik Seeley commented on LUCENE-1316: -- Although this doesn't solve the general problem, this probably still makes sense to do now for the no-deletes case. Todd, can you produce a patch? See http://wiki.apache.org/lucene-java/HowToContribute Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608129#action_12608129 ] Hoss Man commented on LUCENE-1316: -- rather then attempting localized optimizations of individual Query classes, a more generalized improvements would probably be to change SegmentReader.isDeleted so that instead of the entire method being synchronized, it first checks if the segment has any deletions, and if not then enters a synchronized block to check deletedDocs.get(n). Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Feak updated LUCENE-1316: -- I like Hoss' suggestion better. I'll try that fix locally and if it provides the same improvement, I will submit a patch for you. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608134#action_12608134 ] Yonik Seeley commented on LUCENE-1316: -- a more generalized improvements would probably be to change SegmentReader.isDeleted so that instead of the entire method being synchronized Right, but that's not totally back compatible. Code that depended on deletes being instantly visible across threads would no longer be guaranteed. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: per-field similarity
: Might also consider passing in more optional context when retrieving : the similarity for a field (such as a Query, if searching). : Something like Similarity.getSimilarity(String field, Query q). i assume you mean Searcher.getSimilarity(String fieldName, Query q) to replace the current Searcher.getSimilarity() right? (where in both cases we are talking about an instance method and not a static method) There's been some discussions about this in the past, I think at one point Doug suggested almost the exact same thing in this thread... http://www.nabble.com/-jira--Created%3A-%28LUCENE-577%29-SweetSpotSimiliarity-to4533741.html#a4536312 ...it could probably be done in a completley backwards compatible way. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608137#action_12608137 ] Hoss Man commented on LUCENE-1316: -- bq. Code that depended on deletes being instantly visible across threads would no longer be guaranteed. you lost me there ... why would deletes be stop being instantly visible if we changed this... {code} public synchronized boolean isDeleted(int n) { return (deletedDocs != null deletedDocs.get(n)); } {code} ...to this... {code} public boolean isDeleted(int n) { if (null == deletedDocs) return false; synchronized (this) { return (deletedDocs.get(n)); } } {code} ? Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: per-field similarity
On Wed, Jun 25, 2008 at 2:19 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Might also consider passing in more optional context when retrieving : the similarity for a field (such as a Query, if searching). : Something like Similarity.getSimilarity(String field, Query q). i assume you mean Searcher.getSimilarity(String fieldName, Query q) to replace the current Searcher.getSimilarity() right? No, I meant Similarity (it's more like a factory method on the Similarity class). The Searcher.getSimilarity() could remain unchanged. A Similarity is what is passed into the IndexWriter, and you would want the same per-field flexibility there. (where in both cases we are talking about an instance method and not a static method) Right. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608146#action_12608146 ] robert engels commented on LUCENE-1316: --- According to the java memory model, hasDeletions() would need to be synchronized as well , since if another thread did perform a deletion, it would need to be up to date. This might work in later JVMs by declaring the deletedDocs variable volatile, but no guarantees. Seems better to ALLOW this behavior, that a reader might not see up to date deletions made during a query, and do a single synchronized check of deletions at the start. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608147#action_12608147 ] Yonik Seeley commented on LUCENE-1316: -- bq. why would deletes be stop being instantly visible It's minor, but before, if thread A deleted a document, and then thread B checked if it was deleted, thread B was guaranteed to see that it was in fact deleted. If the check for deletedDocs == null were moved outside of the synchronized, there's no guarantee when thread B will see (if ever) that deletedDocs has been set (no memory barrier). It's not a major issue since client code shouldn't be written that way IMO, but it's worth factoring into the decision. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608149#action_12608149 ] robert engels commented on LUCENE-1316: --- The Pattern#5 referenced (cheap read-write lock) is exactly what is trying to be accomplished. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608160#action_12608160 ] Yonik Seeley commented on LUCENE-1316: -- bq. declaring the deletedDocs volatile should do the trick. Right... that would be cheaper when no docs were deleted. But would it be more expensive when there were deleted docs (a volatile + a synchronized?) I don't know if lock coarsening could do anything with this case... Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162 ] Mark Miller commented on LUCENE-1316: - If I remember correctly, volatile does not work correctly until java 1.5. At best I think it was implementation dependent with the old memory model. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162 ] [EMAIL PROTECTED] edited comment on LUCENE-1316 at 6/25/08 12:40 PM: --- If I remember correctly, volatile does not work correctly until java 1.5. At best I think it was implementation dependent with the old memory model. *edit* maybe its ok under certain circumstances: http://www.ibm.com/developerworks/library/j-jtp02244.html Problem #2: Reordering volatile and nonvolatile stores was (Author: [EMAIL PROTECTED]): If I remember correctly, volatile does not work correctly until java 1.5. At best I think it was implementation dependent with the old memory model. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608183#action_12608183 ] Hoss Man commented on LUCENE-1316: -- bq. if thread A deleted a document, and then thread B checked if it was deleted, thread B was guaranteed to see that it was in fact deleted. Hmmm i'll take your word for it, but i don't follow the rational: the current synchronization just ensured that either the isDeleted() call will complete before the delete() call started or vice versa -- but you have no guarantee that thread B would run after thread A and get true. unless... is your point that without synchronization on the null check there's no garuntee that B will ever see the change to deletedDocs even if it does execute after delete() ? either way: robert's point about hasDeletions() needing to be synchronized seems like a bigger issue -- isn't that a bug in the current implementation? assuming we fix that then it seems like the original issue is back to square one: synchro bottlenecks when there are no deletions. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608187#action_12608187 ] robert engels commented on LUCENE-1316: --- Hoss, that is indeed the case, another thread would see deletedDocs as null, even though another thread has set it hasDeletions does not need to be synchronized if deletedDocs is volatile Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery and DocIdSet; Was: Fwd: changing index format
I am not sure, BooleanQuery takes something that can score, e.g. being a Clause or a Query, the contract requires some sort of scoring functionality. We use DocIdSetQuery for some of the scoring capabilities such as constant score (with boosting), age decay, and using the new scoring api in 2.3. Maybe I am misunderstanding the point o the question. Thanks -John On Wed, Jun 25, 2008 at 10:32 AM, Paul Elschot [EMAIL PROTECTED] wrote: Op Wednesday 25 June 2008 18:45:16 schreef John Wang: Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery: Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery class that honors the Query api contracts. Given these tools, we are able to build a customized scored BooleanQuery-like query infrastructure. We'd be happy to contribute them. Another thing to be removed from near the end of my todo list? Perhaps I could even take a vacation :) More seriously: would DocIdSetQuery be superfluous when a DocIdSet could be added directly to a BooleanQuery? Could you elaborate a bit on the customized scoring? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608189#action_12608189 ] Yonik Seeley commented on LUCENE-1316: -- bq. is your point that without synchronization on the null check there's no garuntee that B will ever see the change to deletedDocs even if it does execute after delete() Right... it's about the memory barrier. The reality is that there is normally a need for higher level synchronization anyway. That's why it was always silly for things like StringBuffer to be synchronized. bq. assuming we fix that then it seems like the original issue is back to square one: synchro bottlenecks when there are no deletions. A scorer could just check once when initialized... there's never been any guarantees about in-flight queries immediately seeing deleted docs changes - now that *really* doesn't make sense. TermScorer grabs the whole bit vector at the start and never checks again. Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.3 Environment: All Reporter: Todd Feak Priority: Minor Attachments: MatchAllDocsQuery.java Original Estimate: 1h Remaining Estimate: 1h The isDeleted() method on IndexReader has been mentioned a number of times as a potential synchronization bottleneck. However, the reason this bottleneck occurs is actually at a higher level that wasn't focused on (at least in the threads I read). In every case I saw where a stack trace was provided to show the lock/block, higher in the stack you see the MatchAllScorer.next() method. In Solr paricularly, this scorer is used for NOT queries. We saw incredibly poor performance (order of magnitude) on our load tests for NOT queries, due to this bottleneck. The problem is that every single document is run through this isDeleted() method, which is synchronized. Having an optimized index exacerbates this issues, as there is only a single SegmentReader to synchronize on, causing a major thread pileup waiting for the lock. By simply having the MatchAllScorer see if there have been any deletions in the reader, much of this can be avoided. Especially in a read-only environment for production where you have slaves doing all the high load searching. I modified line 67 in the MatchAllDocsQuery FROM: if (!reader.isDeleted(id)) { TO: if (!reader.hasDeletions() || !reader.isDeleted(id)) { In our micro load test for NOT queries only, this was a major performance improvement. We also got the same query results. I don't believe this will improve the situation for indexes that have deletions. Please consider making this adjustment for a future bug fix release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to do a query using less than or greater than
: and how to use them? For a concrete example I'm looking to do a query : on a date field to find documents earlier than a specified date or : later than a specified date. Ex: date:( 20070101) or date: : (20070101). I looked at the range query feature but it didn't appear : to cover this case. Anyone have any suggestions? RangeQuery (and ConstantScoreRangeQuery) can both cover this case by setting either the upper or lowe term to null. Incidently... http://people.apache.org/~hossman/#java-dev Please Use [EMAIL PROTECTED] Not [EMAIL PROTECTED] Your question is better suited for the [EMAIL PROTECTED] mailing list ... not the [EMAIL PROTECTED] list. java-dev is for discussing development of the internals of the Lucene Java library ... it is *not* the appropriate place to ask questions about how to use the Lucene Java library when developing your own applications. Please resend your message to the java-user mailing list, where you are likely to get more/better responses since that list also has a larger number of subscribers. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: per-field similarity
On Wed, Jun 25, 2008 at 5:06 PM, Chris Hostetter [EMAIL PROTECTED] wrote: Hmmm... that seems like it would be confusing: particularly since in the IndexWriter case the Query param would never make sense. changing IndexWriter.getSimilarity to take a String fieldName and changing Searcher.getSimilarity to take String fieldName, Query q seem like they would be more straight forward. That would require a user to subclass both IndexWriter and Searcher. Since Similarity is already passed around, adding a factory method their seems like the easiest approach. It's also a class, so we could easily add a method. An optional Query param or other context (or more than one factory method) was just a quick idea... may or may not ultimately make sense. (There's also the potential ambiguity of how many times do i call Similarity.getSimilarity() before i stop? ... it may seem silly, but if you're working in a Query or Scorer or Weight you may not be sure if it's been done yet) Once per level? When creating the Weight I would think. If you call again, the default impl would return this. It might be a little cleaner to pass around a SimilarityFactory, but that ship has sailed IMO (along with many others :-) -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: per-field similarity
On 24-Jun-08, at 1:28 PM, Yonik Seeley wrote: Something to consider for Lucene 3 is to have something to retrieve Similarity per-field rather than passing the field name into some functions... +1 I've felt that this was the proper (and more useful) way to do things for a long time (http://markmail.org/message/56bk6wrbwallyjvr) -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to do a query using less than or greater than
Chris, That's exactly what I was looking for. Thanks for the info and the clarification on where to post my questions. Regards, Kyle On Wed, Jun 25, 2008 at 5:12 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : and how to use them? For a concrete example I'm looking to do a query : on a date field to find documents earlier than a specified date or : later than a specified date. Ex: date:( 20070101) or date: : (20070101). I looked at the range query feature but it didn't appear : to cover this case. Anyone have any suggestions? RangeQuery (and ConstantScoreRangeQuery) can both cover this case by setting either the upper or lowe term to null. Incidently... http://people.apache.org/~hossman/#java-devhttp://people.apache.org/%7Ehossman/#java-dev Please Use [EMAIL PROTECTED] Not [EMAIL PROTECTED] Your question is better suited for the [EMAIL PROTECTED] mailing list ... not the [EMAIL PROTECTED] list. java-dev is for discussing development of the internals of the Lucene Java library ... it is *not* the appropriate place to ask questions about how to use the Lucene Java library when developing your own applications. Please resend your message to the java-user mailing list, where you are likely to get more/better responses since that list also has a larger number of subscribers. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]