[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-24 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

Here the updated patch after some changes in trunk. It also adds missCount 
checks back to Caching*Filters, I lost then during cleanup.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-11 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

Patch that fixes the Weight.scoreDocsOutOfOrder method to return the inner 
weight's setting. The scorer can still return docs in order, but that was 
identical behaviour in previous unpatched trunk (IS looked at the out-of- order 
setting of the weight and uses correct collector, but once a filter was 
applied, the documents came in order). My patch only missed to pass this 
setting to our wrapper query.

Mike: If you have time, can you check this? We may need a test, that uses a 
larger index and tests FilteredQuery on top of it, the current indexes used for 
filtering are simply too small and in most cases have only one segment :(

There is no need for Robert's hack (that does not work correctly with aceptDocs 
!= liveDocs), if different BooleanScorers return significant different scores, 
it as a bug, not a problem in FilteredQuery. Slight score changes and therefor 
different order in results is not a problem at all - this is just my opinion.

bq. If we want to check that the results are identical, the benchmark test must 
explicitely request docs-in-order on trunk vs. patch to be consistent. But then 
it's no longer a benchmark.

This is of course untrue, sorry. If the weight returns that docs *may* come out 
of order, the collector should handle this.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, 
 luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-11 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: (was: LUCENE-1536.patch)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-11 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

Sorry last patch upload was somehow corrupted, maybe because of JIRA issues. 
This is the one only implementing Weight.scoreDocsOutOfOrder(), no hacks - to 
have a clean start.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-11 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: (was: LUCENE-1536.patch)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-11 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, 
 luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-10 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536_hack.patch

hack patch that computes the heuristic up front in weight init, so it scores 
all segments consistently and returns the proper scoresDocsOutOfOrder for BS1.

Uwe's new test (the nestedFilterQuery) doesnt pass yet, don't know why.

I recomputed the benchmarks:
{noformat}
Task   QPS trunkStdDev trunk   QPS patchStdDev patch  Pct 
diff
  PhraseF1.0   11.990.207.790.23  -37% -  
-31%
TermF0.5  135.147.62  116.570.36  -18% -   
-8%
   AndHighHighF100.0   17.340.78   15.440.15  -15% -   
-5%
AndHighHighF95.0   17.280.66   15.480.17  -14% -   
-5%
AndHighHighF90.0   17.310.76   15.580.19  -14% -   
-4%
AndHighHighF99.0   17.051.02   15.450.17  -15% -   
-2%
AndHighHighF75.0   17.470.78   16.030.15  -12% -   
-3%
 AndHighHighF5.0   20.690.95   19.780.23   -9% -
1%
 AndHighHighF1.0   35.111.46   33.640.36   -8% -
1%
 AndHighHighF0.1  136.043.70  132.001.41   -6% -
0%
 AndHighHigh   18.250.70   17.740.20   -7% -
2%
 AndHighHighF0.5   49.841.72   48.580.49   -6% -
1%
TermF0.1  351.18   11.01  345.851.73   -4% -
2%
Fuzzy2F100.0   95.524.21   94.332.07   -7% -
5%
  SloppyPhraseF100.08.010.287.910.09   -5% -
3%
 Fuzzy2F90.0   95.423.86   94.511.74   -6% -
5%
 Fuzzy2F95.0   95.204.86   94.331.83   -7% -
6%
  Fuzzy1F1.0   54.021.67   53.561.07   -5% -
4%
  PhraseF2.07.730.077.680.18   -3% -
2%
   SloppyPhraseF99.07.990.237.950.10   -4% -
3%
AndHighHighF50.0   17.540.79   17.460.12   -5% -
4%
  Fuzzy2F0.1  105.393.93  105.343.74   -7% -
7%
  SpanNearF100.03.160.063.160.04   -2% -
2%
 Fuzzy2F99.0   94.026.86   94.211.97   -8% -   
10%
 Fuzzy2F75.0   95.563.51   95.762.02   -5% -
6%
WildcardF2.0   52.790.27   53.050.57   -1% -
2%
  Fuzzy1F0.5   58.121.83   58.431.22   -4% -
5%
  PhraseF0.1   66.340.78   66.731.68   -3% -
4%
SloppyPhraseF0.1   56.151.52   56.790.64   -2% -
5%
SloppyPhrase8.080.268.180.08   -2% -
5%
PKLookup  176.595.07  178.965.71   -4% -
7%
SpanNearF0.1   32.360.56   32.830.54   -1% -
4%
  OrHighHighF0.1   78.200.52   79.440.740% -
3%
   SloppyPhraseF95.07.910.088.050.090% -
3%
  Fuzzy2   94.873.72   96.491.62   -3% -
7%
  OrHighHighF0.5   31.410.47   31.960.330% -
4%
   SpanNearF99.03.120.063.180.030% -
4%
WildcardF0.5   61.970.56   63.280.820% -
4%
  PhraseF0.5   19.780.26   20.290.310% -
5%
SpanNear3.190.083.270.05   -1% -
6%
WildcardF0.1   67.450.64   69.240.890% -
4%
   SloppyPhraseF90.08.000.298.210.12   -2% -
8%
   SpanNearF95.03.130.043.230.031% -
5%
Wildcard   43.190.34   44.641.400% -
7%
 Fuzzy2F50.0   95.124.22   98.692.28   -2% -   
11%
  Fuzzy1   55.284.53   57.680.76   -4% -   
15%
  OrHighHigh   12.130.99   12.710.43   -6% -   
18%
  Phrase3.600.043.810.043% -
7%
   SpanNearF90.03.150.053.350.043% -
9%
Term   71.690.40   76.534.130% -   
13%
 PhraseF99.03.430.033.680.045% -
9%
PhraseF100.03.390.053.670.045% -   
10%
   SloppyPhraseF75.08.04

[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-08 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: changes-yonik-uwe.patch
LUCENE-1536.patch
LUCENE-1536-rewrite.patch

Attached you will find a new patch LUCENE-1536.patch, incorporating Yonik's 
changes plus some minor improvements:
- changed Javadocs of DIS.bits() to explain what you should do/not do.
- Added another early exit condition in FilteredQuery#Weight.scorer(): As we 
already get the first matching doc of the filter iterator before looking at 
bits or creating the query scorer, we should erly exit, if the first matching 
doc is Disi.NO_MORE_DOCS. This   saves us from creating the Query Scorer.
- I removed Robert's safety TODO in SolrIndexSearcher. It no longer disabled 
random access completely. After Yoniks changes, all places in Solr that are not 
random access secure are disabled - e.g. SolrIndexSearcher.FilterImpl (not sure 
what this class does, maybe it should also implement bits()?) - we should do 
that in a Solr specific optimization issue.

Some other cool thing with filters is ANDing filters without ChainedFilter 
(this approach is is very effective with random access as it does not allocate 
additional BitsSet). If you want to AND together several filters and apply them 
to a Query, do the following:

{code:java}
IS.search(new FilteredQuery(query,filter2), filter1,...);
{code}

You can chain even more filters in by adding more FilteredQueries. What this 
does:
IS will automatically create another FilteredQuery to apply the filter and get 
the Weight of the top-level FilteredQuery. The scorer of this one will be 
top-level, get the filter and if it is random access, it will execute the 
filter with acceptDocs==liveDocs. The result bits of this filter will be passed 
to Weight.scorer of the second FilteredQuery as acceptDocs. This one will pass 
the acceptDocs (which are already filtered) to its Filter and if again random 
access pass those as acceptDocs to the inner Query's scorer. Finally the 
top-level IS will execute scorer.score(Collector), which in fact is the inner 
Query's scorer (no wrappers!) with all filtering applied in acceptDocs. This is 
incredible cool :-)

One thing about large patches in an issue:
If you are working on an issue and have you local changes in your checkout and 
posted a patch to an issue and somebody else, posted an updated patch to an 
issue, it is often nice to see the diff between those patches. I wanted to see 
what Yonik changed, but a 140 K patch is not easy to handle. The trick is 
interdiff from patchutils package: You can call interdiff 
LUCENE-1536-original.patch LUCENE-1536-yonik.patch and you get a patch of only 
changes applied by Yonik. This patch can even be applied to your local already 
patched checkout.

The changes-yonik-uwe.patch was generated that way and shows, what changes I 
did in my last patch in contrast to Yoniks original.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 

[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-08 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536.patch

patch moving the heuristic to FilteredQuery, AssertingIndexSearcher overrides 
wrapFilter() and returns random.nextBoolean().

In some tests we explicitly tested on/off, i fixed these to still do this, 
however i think its a little overkill (since newSearcher randomizes this 
anyway), but i left them working the same way.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, changes-yonik-uwe.patch, 
 luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-08 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

Roberts patch with TestFilteredQuery cleaned up a little bit (not thousands of 
anonymous subclasses)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-08 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

Improved the test for random access with BucktScorer to also chain two filters. 
This should pass the acceptDocs down so both filters are ANDed together.

Also made AssertingIndexSearcher also use the autodetection. In half of all 
cases it uses autodetection from parent class, in the other half it decides 
between random access and iterator access.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 changes-yonik-uwe.patch, luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-07 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

Improved patch with modules and contrib fixed, Solr still on TODO:
- BooleanFilter and ChainedFilter do not apply acceptDocs to their Filter 
clauses. Instead they apply the acceptDocs on the final DocIdSet using 
BitsFilteredDocIdSet (see below). This improves filter performance, as the 
deleted documents are not applied on each clause.
- New helper class BitsFilteredDocIdSet, which supplies a wrap method, that can 
apply a Bits (e.g. acceptDocs) to a DocIdSet. This is useful for Filters, that 
build DocIdSets without respecting the acceptDocs parameter and only want to 
apply the deletions live.
- CachingWrapperFilter and CachingSpanFilter now also use the acceptDocs 
wrapper, as the filters are cached without acceptDocs.
- Fixed Javadocs, small changes to IndexSearcher

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-07 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536.patch

updated patch: i fixed solr to the api changes and simply disabled the 
optimization in SolrIndexSearcher.

I think this is the most conservative way to proceed, we can then open a 
followup issue to make whatever changes are necessary to Solr APIs so it can 
use the optimization (looks complex)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-07 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

Patch with Solr fixes by Robert improved. Now usage of acceptDocs is correct. 
There is room to optimize, but this is the correct way to solve (as Filters are 
required to respect deleted docs in trunk).

Now the remaining caching wrapper changes, then we are ready.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-07 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: luceneutil.patch

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 luceneutil.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-06 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

A first rewrite of Lucene core to pass acceptDocs down to Filter.getDocIdSet:
- optimized and simpliefied CachingWrapper* - no deletesmode anymore
- FieldCacheTermsFilter has optimized DocIdSet
- Added bits() to all DocIdSet
- IndexSearcher.searchWithFilter was rewritten to pass liveDocs down.
- AndBits is no longer needed

The tests are not yet rewritten, still 55 compile errors This patch is just 
for review

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-06 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536.patch

adding back this optimization, again.

before committing please give me time to write tests to ensure we aren't losing 
these optimizations.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-06 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

Here a patch with almost all core tests rewritten (I left out the 
CachingWrapper tests, as I nuked DeletesMode). Its just for demonstartion.

Some tests have really stupid filters and work only with optimized indexes. I 
added asserts in those filters (except one), that acceptDocs==null. The 
remaining one uses QueryUtils and I have no idea whats going on there, that the 
acceptDocs!=null.

When looking at the code in IndexSearcher, I would propose to remove all Filter 
special handling in IndexSaercher and move all code over to FilteredQuery (with 
all our optimizations). If you call IS.search(query, filter,...), IndexSearcher 
would simply wrap with FilteredQuery and we would have no code duplication and 
much easier maintainability in IS.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-06 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

New patch (still only Lucene Core, no contrib/modules/solr modified):
- Nuked Filter handling completely from IndexSearcher. Algorithms and Random 
access optimizations were added to FilteredQuery. IS.search(Query, Filter,...) 
now only wraps the query with the Filter, if filter!=null (small helper method).
- The random access threshhold is still in 
IndexSearcher.setFilterRandomAccessThreshold(), FilteredQuery gets it in it's 
weight from IndexSearcher. This is maybe not the best solutions, we can also 
add a setter to FilteredQuery and IS passes it to FilteredQuery.

What do you think? Mike: Can you do perf tests?

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-06 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536-rewrite.patch

New patch:
- Fixed the FilteredQuery-Scorer's advance by logic change. Its now much easier 
to understand. The corresponding tests are in TestFilteredQuery: All tests are 
executed 2 times: as random access filter and as iterator filter. Also 
FilteredQuery is added to BQ, so the conventional scorer (nextDoc/advance) is 
tested.

The tests for CachingWrapper* are still disabled, have to rewrite them 
tomorrow. Then we can change contrib and Solr.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
 LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-05 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

New patch:
- Fixed FieldCacheRangeFilterDocIdSet to implement bits
- Removed deleted docs handling in this filter. We explicitely pass false. No 
the iterator and get(bits) may return deleted docs.
- Made AndBits final (speed and no need to subclass)

The question here: I see no filter that explicitely returns true for the 
deletedDocs handling. But e.g. most filters actually dont have deleted docs, 
like MultiTermQueryWF,...

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-05 Thread Uwe Schindler (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1536:
--

Attachment: LUCENE-1536.patch

I modified my patch for FieldCacheRangeFilter to enforce that it does not 
respect deletes (it throws UOE on the setter).

I checked some combinations, setContainsOnlyLiveDocs() cannot go in like that!!!

CachingWrapperFilter should have its own handling of this and return the 
correct setting, but not modify an already created bitset/whatever DocIdSet. My 
above patch does no longer respect deleted docs in FieldCacheRangeFilter and 
returns false from the beginning. But if you cache this filter, it suddenly 
tries to set the boolean to true - test fail

Also all Filters that internally use deleted docs, should return true from the 
beginning. We can maybe add the setter to FixedBitSet, so filter impls can 
easily set this bit, but it should not be a setter in DocIdSet!

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-04 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536.patch

updated patch:
* implements heuristic with getter/setter (defaults to random access if we 
estimate filter accepts  1% of documents)
* sets the inOrder/topLevel stuff correct when we use random access -- but we 
should add a test that we get BS1 here!
* when the filter is sparse, but contains only live docs, we pass null as 
liveDocs.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-04 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1536:


Attachment: LUCENE-1536.patch

just a code style tweak, a test to make sure we get BS1 with random access 
filters, and randomization of the parameter value in LuceneTestCase.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-04 Thread Chris Male (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

Small improvements:

- Changed isContainsOnlyLiveDocs to containsOnlyLiveDocs
- Changed visibility of containsOnlyLiveDocs field since with the 
getter/setters it doesn't need to be protected.
- Tidied the documentation on the getter/setters of the threshold in IS

I think we're good to go?

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-10-02 Thread Chris Male (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

Patch updated following the changes in LUCENE-3474.  (Note, this patch also 
fixes a missed change in LUCENE-3474, in 
MultiPhraseQuery.UnionDocsAndPositionsEnum).

New patch features:

- liveDocsOnly() - containsOnlyLiveDocs()
- getRandomAccessBits() and useRandomAccess() are gone, replaced by some 
additional logic in IS.  Now we check if the DocIdSet is a Bits impl, and then 
we consult a protected method in IS called useRandomAccess().  
- At the moment useRandomAccess() just returns true, while we discuss what 
heuristics are best to apply here.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-28 Thread Michael McCandless (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

New patch folding in Uwe's ideas -- FCRF always uses random access; remove 
matchDoc and just override Bits.get; fixed CWF jdocs to not reference 
OpenBitSetDISI.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-27 Thread Chris Male (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

New patch which adds greater control over the random-access to DocIdSet.

Also fixes most of the nocommits.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-27 Thread Michael McCandless (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

New patch, fixing a few failing tests, adding a couple comments.  I
downgraded the 2 nocommits in LTC to TODOs, and removed the other one
(false is correct default for those two?).

For DocIdSet, can we nuke getRandomAccessBits?  Ie, if
supportRandomAccess() returns true, then we can cast the instance to
Bits?  Maybe we should rename supportRandomAccess to useRandomAccess?
(Ie, it may support it, but we only want to use random access when the
filter is dense enough).

I changed MTQWF's and CachingWrapperFilter's threshold to a double
percent (of the reader's maxDoc) instead, default 1.0%.

Hmm FieldCacheRangeFilter never enables random access... not sure
how/when we should compute that.

I think we are getting close!


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-22 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

I think I have managed to update this to trunk.  The only dated aspect is the 
use of OpenBitSet, which if I understand correctly, has been replaced by 
FixedBitSet for most cases.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-22 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

New patch which includes the missed merging, some small tidy-ups and converting 
over to FixedBitSet.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-07-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

Patch.  Lots of nocommits still but tests pass.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

2011-06-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: LUCENE-1536.patch

Initial patch for trunk... lots of nocommits, but tests all pass and I
think this is [roughly] the approach we should take to get fast(er)
Filter perf.

Conceptually, this change is fairly easy, because the flex APIs all
accept a Bits to apply low-level filtering.  However, this Bits is
inverted vs the Filter that callers pass to IndexSearcher (skipDocs vs
keepDocs), so, my patch inverts 1) the meaning of this first arg to
the Docs/AndPositions enums (it becomes an acceptDocs instead of
skipDocs), and 2) deleted docs coming back from IndexReaders (renames
IR.getDeletedDocs - IR.getNotDeletedDocs).

That change (inverting the Bits to be keepDocs not skipDocs) is the
vast majority of the patch.

The real change is to add DocIdSet.getRandomAccessBits and
bitsIncludesDeletedDocs, which IndexSearcher then consults to figure
out whether to push the filter low instead of high.  I then fixed
OpenBitSet to return this from getRandomAccessBits, and fixed
CachingWrapperFilter to turn this on/off as well as state whether
deleted docs were folded into the filter.

This means filters cached with CachingWrapperFilter will apply low,
and if it's DeletesMode.RECACHE then it's a single filter that's
applied (else I wrap with an AND NOT deleted check per docID), but
custom filters are also free to impl these methods to have their
filters applied low.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2011-03-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1536:


Labels: mentor  (was: gsoc2011 lucene-gsoc-11)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2011-03-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1536:


Labels: gsoc2011, lucene-gsoc-11 mentor,  (was: mentor)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011,, lucene-gsoc-11, mentor,
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2011-03-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1536:


Labels: gsoc2011 lucene-gsoc-11 mentor  (was: gsoc2011, lucene-gsoc-11 
mentor,)

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2011-03-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Labels: gsoc2011 lucene-gsoc-11  (was: )

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2010-09-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Attachment: CachedFilterIndexReader.java

By subclassing FilterIndexReader, and taking advantage of how the flex
APIs now let you pass a custom skipDocs when pulling the postings, I
created a prototype class (attached, named CachedFilterIndexReader) that
up-front compiles the deleted docs for each segment with the negation
of a Filter (that you provide), and returns a reader that applies that
filter.

This is nice because it's fully external to Lucene, and it gives
awesome gains in many cases (see
http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html).

I don't think we should commit this class -- we should instead fix
Filters correctly!  But it's a nice workaround until we do that.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-09-15 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1536:
-

Attachment: LUCENE-1536.patch

* Added a IndexReader.termDocs(term, filter) method which
  culminates in passing a RandomAccessDocIdSet to the scorers.
  ConstantScoreQuery and FilterQuery need to AND the filters
  together. 

* Does explain need the filter?

* Spans aren't supported yet





 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-06-11 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Fix Version/s: (was: 2.9)
   3.1

Moving out.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1536:
-

Attachment: LUCENE-1536.patch

* Filter has a parameter for included deletes. IndexSearcher
uses it for setting the RandomAccessDocIdSet on the Scorers

* MatchAllDocsScorer in addition to TermScorer properly supports
setting a RandomAccessDocIdSet 

* Added more test cases

* AndRandomAccessDocIdSet.iterator() and BitVector.iterator()
needs to be implemented 

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-20 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1536:
-

Attachment: LUCENE-1536.patch

The patch is a start at something that will work within the API
(I think). 

* Created a RandomAccessDocIdSet class that BitVector and
OpenBitSet implement. 

* SegmentTermDocs has a setter for RandomAccessDocIdSet with the
option of including the deletedDocs

* IndexSearcher.doSearch does the main work of setting the
RandomAccessDocIdSet on the TermScorers

* BooleanScorer2 and other classes implement
getSequentialSubScorers

* AndRandomAccessDocIdSet iterates over an array of
RandomAccessDocIdSets

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1536:
---

Fix Version/s: 2.9

Given the sizable performance gains, I think we should try to do this for 2.9, 
if possible.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org