[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-11-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651837#action_12651837
 ] 

Paul Elschot commented on LUCENE-855:
-

On the face of it, this has some overlap with the recent FieldCacheRangeFilter 
of LUCENE-1461 .
Any comments?

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-11-30 Thread Luke Nezda (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651876#action_12651876
 ] 

Luke Nezda commented on LUCENE-1471:


I had a look at this code and it looks like an easy opportunity.  Here's my 
analysis
  * let m = searchables.length
  * let n = nDocs
- Current performance: n * m * lg( n )
  n * m * lg( n ) + // fill queue
  n * lg( n )   // drain queue into scoreDocs[]
  if each searcher read has n worse documents than the one before it
- Possible performance: n * lg( m )
m * lg( m ) + // init queue
n * lg( m ) + // drain  fill queue

I'll attach a patch for {{MultiSearcher}} {{search()}} methods that supports 
with and without {{Sort}}.  Its a little kludgy - had to remove {{final}} from 
{{FieldDocSortedHitQueue}}'s {{lessThan}} method and do some casting.  All 
tests pass.  I doubt much search time is tied up here since this is all 
in-memory and n and m are usually small.

 Faster MultiSearcher.search merge docs 
 ---

 Key: LUCENE-1471
 URL: https://issues.apache.org/jira/browse/LUCENE-1471
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 8h
  Remaining Estimate: 8h

 MultiSearcher.search places sorted search results from individual searchers 
 into a PriorityQueue.  This can be made to be more optimal by taking 
 advantage of the fact that the results returned are already sorted.  
 The proposed solution places the sub-searcher results iterator into a custom 
 PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-11-30 Thread Luke Nezda (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Nezda updated LUCENE-1471:
---

Attachment: multisearcher.patch

 Faster MultiSearcher.search merge docs 
 ---

 Key: LUCENE-1471
 URL: https://issues.apache.org/jira/browse/LUCENE-1471
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: multisearcher.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 MultiSearcher.search places sorted search results from individual searchers 
 into a PriorityQueue.  This can be made to be more optimal by taking 
 advantage of the fact that the results returned are already sorted.  
 The proposed solution places the sub-searcher results iterator into a custom 
 PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1369) Eliminate unnecessary uses of Hashtable and Vector

2008-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651889#action_12651889
 ] 

Mark Miller commented on LUCENE-1369:
-

Did we break our back compat guarantee here? This changes some protected 
signatures in queryparser. If someone was overriding them (which is what they 
are intended for), dropping in the new jar could cause hard to track down 
silent changes (the new method is called, the old one you may have overridden 
is not). There is a similar issue with adding more expressive range query 
syntax that I plan to finish up, so whats the verdict on these types of 
changes? Might as well do as many at once as we can if we are going to do it.

 Eliminate unnecessary uses of Hashtable and Vector
 --

 Key: LUCENE-1369
 URL: https://issues.apache.org/jira/browse/LUCENE-1369
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.3.2
Reporter: DM Smith
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1369.patch


 Lucene uses Vector, Hashtable and Enumeration when it doesn't need to. 
 Changing to ArrayList and HashMap may provide better performance.
 There are a few places Vector shows up in the API. IMHO, List should have 
 been used for parameters and return values.
 There are a few distinct usages of these classes:
 # internal but with ArrayList or HashMap would do as well. These can simply 
 be replaced.
 # internal and synchronization is required. Either leave as is or use a 
 collections synchronization wrapper.
 # As a parameter to a method where List or Map would do as well. For contrib, 
 just replace. For core, deprecate current and add new method signature.
 # Generated by JavaCC. (All *.jj files.) Nothing to be done here.
 # As a base class. Not sure what to do here. (Only applies to SegmentInfos 
 extends Vector, but it is not used in a safe manner in all places. Perhaps, 
 implements List would be better.)
 # As a return value from a package protected method, but synchronization is 
 not used. Change return type.
 # As a return value to a final method. Change to List or Map.
 In using a Vector the following iteration pattern is frequently used.
 for (int i = 0; i  v.size(); i++) {
   Object o = v.elementAt(i);
 }
 This is an indication that synchronization is unimportant. The list could 
 change during iteration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1369) Eliminate unnecessary uses of Hashtable and Vector

2008-11-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651894#action_12651894
 ] 

Yonik Seeley commented on LUCENE-1369:
--

It's definitely iffy - that's why I didn't do these replacements in QueryParser 
when I did the others.

 Eliminate unnecessary uses of Hashtable and Vector
 --

 Key: LUCENE-1369
 URL: https://issues.apache.org/jira/browse/LUCENE-1369
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.3.2
Reporter: DM Smith
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1369.patch


 Lucene uses Vector, Hashtable and Enumeration when it doesn't need to. 
 Changing to ArrayList and HashMap may provide better performance.
 There are a few places Vector shows up in the API. IMHO, List should have 
 been used for parameters and return values.
 There are a few distinct usages of these classes:
 # internal but with ArrayList or HashMap would do as well. These can simply 
 be replaced.
 # internal and synchronization is required. Either leave as is or use a 
 collections synchronization wrapper.
 # As a parameter to a method where List or Map would do as well. For contrib, 
 just replace. For core, deprecate current and add new method signature.
 # Generated by JavaCC. (All *.jj files.) Nothing to be done here.
 # As a base class. Not sure what to do here. (Only applies to SegmentInfos 
 extends Vector, but it is not used in a safe manner in all places. Perhaps, 
 implements List would be better.)
 # As a return value from a package protected method, but synchronization is 
 not used. Change return type.
 # As a return value to a final method. Change to List or Map.
 In using a Vector the following iteration pattern is frequently used.
 for (int i = 0; i  v.size(); i++) {
   Object o = v.elementAt(i);
 }
 This is an indication that synchronization is unimportant. The list could 
 change during iteration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2008-11-30 Thread Mark Lassau (JIRA)
DateTools.stringToDate() can cause lock contention under load
-

 Key: LUCENE-1472
 URL: https://issues.apache.org/jira/browse/LUCENE-1472
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor


Load testing our application (the JIRA Issue Tracker) has shown that threads 
spend a lot of time blocked in DateTools.stringToDate().

The stringToDate() method uses a singleton SimpleDateFormat object to parse the 
dates.
Each call to parse is *synchronized* because SimpleDateFormat is not thread 
safe.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2008-11-30 Thread Mark Lassau (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651899#action_12651899
 ] 

Mark Lassau commented on LUCENE-1472:
-

The following methods would potentially suffer contention as well depending on 
usage patterns of the particular app:
* stringToTime()
* dateToString()
* timeToString()


 DateTools.stringToDate() can cause lock contention under load
 -

 Key: LUCENE-1472
 URL: https://issues.apache.org/jira/browse/LUCENE-1472
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor

 Load testing our application (the JIRA Issue Tracker) has shown that threads 
 spend a lot of time blocked in DateTools.stringToDate().
 The stringToDate() method uses a singleton SimpleDateFormat object to parse 
 the dates.
 Each call to parse is *synchronized* because SimpleDateFormat is not thread 
 safe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2008-11-30 Thread Mark Lassau (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651900#action_12651900
 ] 

Mark Lassau commented on LUCENE-1472:
-

[SimpleDateFormat 
javadoc|http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html]:

{quote}
Date formats are not synchronized.
It is recommended to create separate format instances for each thread.
If multiple threads access a format concurrently, it must be synchronized 
externally.
{quote}


 DateTools.stringToDate() can cause lock contention under load
 -

 Key: LUCENE-1472
 URL: https://issues.apache.org/jira/browse/LUCENE-1472
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor

 Load testing our application (the JIRA Issue Tracker) has shown that threads 
 spend a lot of time blocked in DateTools.stringToDate().
 The stringToDate() method uses a singleton SimpleDateFormat object to parse 
 the dates.
 Each call to parse is *synchronized* because SimpleDateFormat is not thread 
 safe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2008-11-30 Thread Mark Lassau (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Lassau updated LUCENE-1472:


Description: 
Load testing our application (the JIRA Issue Tracker) has shown that threads 
spend a lot of time blocked in DateTools.stringToDate().

The stringToDate() method uses a singleton SimpleDateFormat object to parse the 
dates.
Each call to SimpleDateFormat.parse() is *synchronized* because 
SimpleDateFormat is not thread safe.



  was:
Load testing our application (the JIRA Issue Tracker) has shown that threads 
spend a lot of time blocked in DateTools.stringToDate().

The stringToDate() method uses a singleton SimpleDateFormat object to parse the 
dates.
Each call to parse is *synchronized* because SimpleDateFormat is not thread 
safe.




 DateTools.stringToDate() can cause lock contention under load
 -

 Key: LUCENE-1472
 URL: https://issues.apache.org/jira/browse/LUCENE-1472
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor

 Load testing our application (the JIRA Issue Tracker) has shown that threads 
 spend a lot of time blocked in DateTools.stringToDate().
 The stringToDate() method uses a singleton SimpleDateFormat object to parse 
 the dates.
 Each call to SimpleDateFormat.parse() is *synchronized* because 
 SimpleDateFormat is not thread safe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: InstantiatedIndexWriter

2008-11-30 Thread Jason Rutherglen
Hi Karl,

I can update InstantiatedIndexWriter to work with the new TokenStream
API.  What about MemoryIndex?  Is it incompatible now as well?

Jason

On 11/26/08, Karl Wettin [EMAIL PROTECTED] wrote:
 I was just about to get on with LUCENE-1462 when I noticed the new
 TokenStream API. (Yeah, I've been really busy with other stuff for a
 while now.)

 Rather than keeping InstantiatedIndexWriter in sync with IndexWriter
 I'm considering suggesting that we simply delete
 InstantiatedIndexWriter.

 There is this one major caveats that would go away if we removed
 InstantiatedIndexWriter: it lacks read/write locks at commit time.
 Also, the javadocs says consider using II as an immutable store all
 over the place..

 I'm a bit split here, I can see the use of beeing able to add a few
 documents to an existing II, but at the same time these indices are
 ment to be really small so creating a new one from an IndexReader is
 really no big deal. This operation means a few seconds of overhead if
 one needs to append data to the II.


 I say that we should remove it from trunk. Less hassles. Or is this to
 remove good functionallity? I never use it, it was written in order to
 understand Lucene. But if people find it is very useful then of course
 it should be kept in there.

 That might be a problem for some people. For instance I think Jason
 Rutherglens realtime search use this class.


   karl

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]