[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651837#action_12651837 ] Paul Elschot commented on LUCENE-855: - On the face of it, this has some overlap with the recent FieldCacheRangeFilter of LUCENE-1461 . Any comments? MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs
[ https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651876#action_12651876 ] Luke Nezda commented on LUCENE-1471: I had a look at this code and it looks like an easy opportunity. Here's my analysis * let m = searchables.length * let n = nDocs - Current performance: n * m * lg( n ) n * m * lg( n ) + // fill queue n * lg( n ) // drain queue into scoreDocs[] if each searcher read has n worse documents than the one before it - Possible performance: n * lg( m ) m * lg( m ) + // init queue n * lg( m ) + // drain fill queue I'll attach a patch for {{MultiSearcher}} {{search()}} methods that supports with and without {{Sort}}. Its a little kludgy - had to remove {{final}} from {{FieldDocSortedHitQueue}}'s {{lessThan}} method and do some casting. All tests pass. I doubt much search time is tied up here since this is all in-memory and n and m are usually small. Faster MultiSearcher.search merge docs --- Key: LUCENE-1471 URL: https://issues.apache.org/jira/browse/LUCENE-1471 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Original Estimate: 8h Remaining Estimate: 8h MultiSearcher.search places sorted search results from individual searchers into a PriorityQueue. This can be made to be more optimal by taking advantage of the fact that the results returned are already sorted. The proposed solution places the sub-searcher results iterator into a custom PriorityQueue that produces the sorted ScoreDocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1471) Faster MultiSearcher.search merge docs
[ https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Nezda updated LUCENE-1471: --- Attachment: multisearcher.patch Faster MultiSearcher.search merge docs --- Key: LUCENE-1471 URL: https://issues.apache.org/jira/browse/LUCENE-1471 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Attachments: multisearcher.patch Original Estimate: 8h Remaining Estimate: 8h MultiSearcher.search places sorted search results from individual searchers into a PriorityQueue. This can be made to be more optimal by taking advantage of the fact that the results returned are already sorted. The proposed solution places the sub-searcher results iterator into a custom PriorityQueue that produces the sorted ScoreDocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1369) Eliminate unnecessary uses of Hashtable and Vector
[ https://issues.apache.org/jira/browse/LUCENE-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651889#action_12651889 ] Mark Miller commented on LUCENE-1369: - Did we break our back compat guarantee here? This changes some protected signatures in queryparser. If someone was overriding them (which is what they are intended for), dropping in the new jar could cause hard to track down silent changes (the new method is called, the old one you may have overridden is not). There is a similar issue with adding more expressive range query syntax that I plan to finish up, so whats the verdict on these types of changes? Might as well do as many at once as we can if we are going to do it. Eliminate unnecessary uses of Hashtable and Vector -- Key: LUCENE-1369 URL: https://issues.apache.org/jira/browse/LUCENE-1369 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.3.2 Reporter: DM Smith Assignee: Michael McCandless Priority: Minor Fix For: 2.4 Attachments: LUCENE-1369.patch Lucene uses Vector, Hashtable and Enumeration when it doesn't need to. Changing to ArrayList and HashMap may provide better performance. There are a few places Vector shows up in the API. IMHO, List should have been used for parameters and return values. There are a few distinct usages of these classes: # internal but with ArrayList or HashMap would do as well. These can simply be replaced. # internal and synchronization is required. Either leave as is or use a collections synchronization wrapper. # As a parameter to a method where List or Map would do as well. For contrib, just replace. For core, deprecate current and add new method signature. # Generated by JavaCC. (All *.jj files.) Nothing to be done here. # As a base class. Not sure what to do here. (Only applies to SegmentInfos extends Vector, but it is not used in a safe manner in all places. Perhaps, implements List would be better.) # As a return value from a package protected method, but synchronization is not used. Change return type. # As a return value to a final method. Change to List or Map. In using a Vector the following iteration pattern is frequently used. for (int i = 0; i v.size(); i++) { Object o = v.elementAt(i); } This is an indication that synchronization is unimportant. The list could change during iteration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1369) Eliminate unnecessary uses of Hashtable and Vector
[ https://issues.apache.org/jira/browse/LUCENE-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651894#action_12651894 ] Yonik Seeley commented on LUCENE-1369: -- It's definitely iffy - that's why I didn't do these replacements in QueryParser when I did the others. Eliminate unnecessary uses of Hashtable and Vector -- Key: LUCENE-1369 URL: https://issues.apache.org/jira/browse/LUCENE-1369 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.3.2 Reporter: DM Smith Assignee: Michael McCandless Priority: Minor Fix For: 2.4 Attachments: LUCENE-1369.patch Lucene uses Vector, Hashtable and Enumeration when it doesn't need to. Changing to ArrayList and HashMap may provide better performance. There are a few places Vector shows up in the API. IMHO, List should have been used for parameters and return values. There are a few distinct usages of these classes: # internal but with ArrayList or HashMap would do as well. These can simply be replaced. # internal and synchronization is required. Either leave as is or use a collections synchronization wrapper. # As a parameter to a method where List or Map would do as well. For contrib, just replace. For core, deprecate current and add new method signature. # Generated by JavaCC. (All *.jj files.) Nothing to be done here. # As a base class. Not sure what to do here. (Only applies to SegmentInfos extends Vector, but it is not used in a safe manner in all places. Perhaps, implements List would be better.) # As a return value from a package protected method, but synchronization is not used. Change return type. # As a return value to a final method. Change to List or Map. In using a Vector the following iteration pattern is frequently used. for (int i = 0; i v.size(); i++) { Object o = v.elementAt(i); } This is an indication that synchronization is unimportant. The list could change during iteration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load
DateTools.stringToDate() can cause lock contention under load - Key: LUCENE-1472 URL: https://issues.apache.org/jira/browse/LUCENE-1472 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to parse is *synchronized* because SimpleDateFormat is not thread safe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load
[ https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651899#action_12651899 ] Mark Lassau commented on LUCENE-1472: - The following methods would potentially suffer contention as well depending on usage patterns of the particular app: * stringToTime() * dateToString() * timeToString() DateTools.stringToDate() can cause lock contention under load - Key: LUCENE-1472 URL: https://issues.apache.org/jira/browse/LUCENE-1472 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to parse is *synchronized* because SimpleDateFormat is not thread safe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load
[ https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651900#action_12651900 ] Mark Lassau commented on LUCENE-1472: - [SimpleDateFormat javadoc|http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html]: {quote} Date formats are not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally. {quote} DateTools.stringToDate() can cause lock contention under load - Key: LUCENE-1472 URL: https://issues.apache.org/jira/browse/LUCENE-1472 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to parse is *synchronized* because SimpleDateFormat is not thread safe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load
[ https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Lassau updated LUCENE-1472: Description: Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to SimpleDateFormat.parse() is *synchronized* because SimpleDateFormat is not thread safe. was: Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to parse is *synchronized* because SimpleDateFormat is not thread safe. DateTools.stringToDate() can cause lock contention under load - Key: LUCENE-1472 URL: https://issues.apache.org/jira/browse/LUCENE-1472 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor Load testing our application (the JIRA Issue Tracker) has shown that threads spend a lot of time blocked in DateTools.stringToDate(). The stringToDate() method uses a singleton SimpleDateFormat object to parse the dates. Each call to SimpleDateFormat.parse() is *synchronized* because SimpleDateFormat is not thread safe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: InstantiatedIndexWriter
Hi Karl, I can update InstantiatedIndexWriter to work with the new TokenStream API. What about MemoryIndex? Is it incompatible now as well? Jason On 11/26/08, Karl Wettin [EMAIL PROTECTED] wrote: I was just about to get on with LUCENE-1462 when I noticed the new TokenStream API. (Yeah, I've been really busy with other stuff for a while now.) Rather than keeping InstantiatedIndexWriter in sync with IndexWriter I'm considering suggesting that we simply delete InstantiatedIndexWriter. There is this one major caveats that would go away if we removed InstantiatedIndexWriter: it lacks read/write locks at commit time. Also, the javadocs says consider using II as an immutable store all over the place.. I'm a bit split here, I can see the use of beeing able to add a few documents to an existing II, but at the same time these indices are ment to be really small so creating a new one from an IndexReader is really no big deal. This operation means a few seconds of overhead if one needs to append data to the II. I say that we should remove it from trunk. Less hassles. Or is this to remove good functionallity? I never use it, it was written in order to understand Lucene. But if people find it is very useful then of course it should be kept in there. That might be a problem for some people. For instance I think Jason Rutherglens realtime search use this class. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]