[jira] Commented: (LUCENE-1632) boolean docid set iterator improvement

2009-05-12 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708659#action_12708659
 ] 

John Wang commented on LUCENE-1632:
---

I think we have an improvement for ConjuctionScorer as well with about 10% 
improvement. We need to clean it up for a patch.

To make this clear, these are not algorithmic changes, there are code tuning 
work performed on the same algorithm.
The naming is used to be consistent with the current Lucene class names, e.g. 
DocIdSet, DocIdSetIterator.

Feel free to do more code tuning if you feel it would improve performance 
further.

> boolean docid set iterator improvement
> --
>
> Key: LUCENE-1632
> URL: https://issues.apache.org/jira/browse/LUCENE-1632
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: Lucene-1632-patch.txt
>
>
> This was first brought up in Lucene-1345. But Lucene-1345 conversation has 
> digressed. As per suggested, creating a separate issue to track.
> Added perf comparisons with boolean set iterators with current scorers
> See patch
> System: Ubunto, 
> java version "1.6.0_11"
> Intel core2 Duo 2.44ghz
> new milliseconds=470
> new milliseconds=534
> new milliseconds=450
> new milliseconds=443
> new milliseconds=444
> new milliseconds=445
> new milliseconds=449
> new milliseconds=441
> new milliseconds=444
> new milliseconds=445
> new total milliseconds=4565
> old milliseconds=529
> old milliseconds=491
> old milliseconds=428
> old milliseconds=549
> old milliseconds=427
> old milliseconds=424
> old milliseconds=420
> old milliseconds=424
> old milliseconds=423
> old milliseconds=422
> old total milliseconds=4537
> New/Old Time 4565/4537 (100.61715%)
> OrDocIdSetIterator milliseconds=1138
> OrDocIdSetIterator milliseconds=1106
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1066
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1067
> OrDocIdSetIterator milliseconds=1072
> OrDocIdSetIterator milliseconds=1118
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1069
> OrDocIdSetIterator total milliseconds=10831
> DisjunctionMaxScorer milliseconds=1914
> DisjunctionMaxScorer milliseconds=1981
> DisjunctionMaxScorer milliseconds=1861
> DisjunctionMaxScorer milliseconds=1893
> DisjunctionMaxScorer milliseconds=1886
> DisjunctionMaxScorer milliseconds=1885
> DisjunctionMaxScorer milliseconds=1887
> DisjunctionMaxScorer milliseconds=1889
> DisjunctionMaxScorer milliseconds=1891
> DisjunctionMaxScorer milliseconds=1888
> DisjunctionMaxScorer total milliseconds=18975
> Or/DisjunctionMax Time 10831/18975 (57.080368%)
> OrDocIdSetIterator milliseconds=1079
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1076
> OrDocIdSetIterator milliseconds=1093
> OrDocIdSetIterator milliseconds=1077
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1078
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator total milliseconds=10775
> DisjunctionSumScorer milliseconds=1398
> DisjunctionSumScorer milliseconds=1322
> DisjunctionSumScorer milliseconds=1320
> DisjunctionSumScorer milliseconds=1305
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1300
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1317
> DisjunctionSumScorer total milliseconds=13172
> Or/DisjunctionSum Time 10775/13172 (81.80231%)
> AndDocIdSetIterator milliseconds=330
> AndDocIdSetIterator milliseconds=336
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator milliseconds=310
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=334
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator total milliseconds=3100
> ConjunctionScorer milliseconds=332
> ConjunctionScorer milliseconds=307
> ConjunctionScorer milliseconds=302
> ConjunctionScorer milliseconds=350
> ConjunctionScorer milliseconds=300
> ConjunctionScorer milliseconds=304
> ConjunctionScorer milliseconds=305
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=299
> ConjunctionScorer total milliseconds=3105
> And/Conjunction Time 3100/3105 (99.83897%)
> main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr.

[jira] Commented: (LUCENE-1410) PFOR implementation

2009-05-12 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708590#action_12708590
 ] 

Paul Elschot commented on LUCENE-1410:
--

A very recent paper with some improvements to PFOR:
Yan, Ding, Suel,
Inverted Index Compression and Query Processing with Optimized Document 
Ordering,
WWW 2009, April 20-24 2009, Madrid, Spain

Roughly quoting par. 4.2, Optimizing PForDelta compression:
For an exception, we store its lower b bits instead of the offset to the next 
exception in its corresponding slot, while we store the higher overflow bits 
and the offset in two separate arrays. These two arrays are compressed using 
the Simple16 method.
Also b is chosen to optimize decompression speed. This makes the dependence of 
b on the data quite simple, (in the PFOR above here this dependence is more 
complex) and this improves compression speed.

Btw. the document ordering there is by URL. For many terms this gives more 
shorter delta's between doc ids allowing a higher decompression speed of the 
doc ids.


> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
> TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708578#action_12708578
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

I think the easiest way to handle the ram buf size vs. the ram
dir size is the allow each to grow on request. I have some code
I need to test that implements it. This way we're growing based
on demand and availability. The only thing we may want to add is
a way to grow and perhaps automatically flush based on the
growth requested and perhaps prioritizing requests?

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



InstantiatedIndex Memory required

2009-05-12 Thread thiruvee

Hi

So far I am using RAMDirectory for my indexes. To meet the SLA of our
project, i thought of using InstantiatedIndex. But when I used that, i am
not able to get any out put from that and its throwing out of memory error.

What is the ratio between Index size and memory size, when using
InstantiatedIndex.
Here are my index details:

Index size : 200mB
RAM Size : 1 GB


If i try with a small test index of size 100KB, its working.
Please help me with this.

Thanks 
Ravichandra






-- 
View this message in context: 
http://www.nabble.com/InstantiatedIndex-Memory-required-tp23506231p23506231.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-12 Thread Yasuhiro Matsuda (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasuhiro Matsuda updated LUCENE-1634:
-

Attachment: LUCENE-1634.patch

I posted a patch.

> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-12 Thread Yasuhiro Matsuda (JIRA)
LogMergePolicy should use the number of deleted docs when deciding which 
segments to merge
--

 Key: LUCENE-1634
 URL: https://issues.apache.org/jira/browse/LUCENE-1634
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Yasuhiro Matsuda


I found that IndexWriter.optimize(int) method does not pick up large segments 
with a lot of deletes even when most of the docs are deleted. And the existence 
of such segments affected the query performance significantly.

I created an index with 1 million docs, then went over all docs and updated a 
few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
segments with most of docs deleted. Although these segments did not have valid 
docs they remained in the directory for a very long time until more segments 
with comparable or bigger sizes were created.

This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
but does not take the number of deleted documents into consideration when it 
decides which segments to merge. So, a simple fix is to use the delete count to 
calibrate the segment size. I can create a patch for this.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1455) org.apache.lucene.ant.HtmlDocument creates a FileInputStream in its constructor that it doesn't close

2009-05-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1455:
---

Assignee: Mark Miller

> org.apache.lucene.ant.HtmlDocument creates a FileInputStream in its 
> constructor that it doesn't close
> -
>
> Key: LUCENE-1455
> URL: https://issues.apache.org/jira/browse/LUCENE-1455
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
>
> A look through the jtidy source code doesn't show a close that i can find in 
> parse (seems to be standard that you close your own streams anyway), so this 
> looks like a small descriptor leak to me.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1598) While you could use a custom Sort Comparator source with remote searchable before, you can no longer do so with FieldComparatorSource

2009-05-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1598:
---

Assignee: Mark Miller

> While you could use a custom Sort Comparator source with remote searchable 
> before, you can no longer do so with FieldComparatorSource
> -
>
> Key: LUCENE-1598
> URL: https://issues.apache.org/jira/browse/LUCENE-1598
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
>
> FieldComparatorSource is not serializable, but can live on a SortField

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1633) Copy/Paste-Typo in toString() for SpanQueryFilter

2009-05-12 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-1633.
--

   Resolution: Fixed
Fix Version/s: 2.9

Committed.

> Copy/Paste-Typo in toString() for SpanQueryFilter
> -
>
> Key: LUCENE-1633
> URL: https://issues.apache.org/jira/browse/LUCENE-1633
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Bernd Fondermann
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: fix_SpanQueryFilter_toString.patch
>
>
>public String toString() {
> -return "QueryWrapperFilter(" + query + ")";
> +return "SpanQueryFilter(" + query + ")";
>}
> says it all.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1633) Copy/Paste-Typo in toString() for SpanQueryFilter

2009-05-12 Thread Bernd Fondermann (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Fondermann updated LUCENE-1633:
-

Attachment: fix_SpanQueryFilter_toString.patch

> Copy/Paste-Typo in toString() for SpanQueryFilter
> -
>
> Key: LUCENE-1633
> URL: https://issues.apache.org/jira/browse/LUCENE-1633
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Bernd Fondermann
>Priority: Trivial
> Attachments: fix_SpanQueryFilter_toString.patch
>
>
>public String toString() {
> -return "QueryWrapperFilter(" + query + ")";
> +return "SpanQueryFilter(" + query + ")";
>}
> says it all.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1633) Copy/Paste-Typo in toString() for SpanQueryFilter

2009-05-12 Thread Bernd Fondermann (JIRA)
Copy/Paste-Typo in toString() for SpanQueryFilter
-

 Key: LUCENE-1633
 URL: https://issues.apache.org/jira/browse/LUCENE-1633
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Bernd Fondermann
Priority: Trivial


   public String toString() {
-return "QueryWrapperFilter(" + query + ")";
+return "SpanQueryFilter(" + query + ")";
   }

says it all.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org