[jira] Created: (LUCENE-478) CJK char list

2005-12-07 Thread John Wang (JIRA)
CJK char list
-

 Key: LUCENE-478
 URL: http://issues.apache.org/jira/browse/LUCENE-478
 Project: Lucene - Java
Type: Bug
  Components: Analysis  
Versions: 1.4
Reporter: John Wang
Priority: Minor


Seems the character list in the CJK section of the StandardTokenizer.jj is not 
quite complete. Following is a more complete list:

< CJK:  // non-alphabets
  [
   "\u1100"-"\u11ff",
   "\u3040"-"\u30ff",
   "\u3130"-"\u318f",
   "\u31f0"-"\u31ff",
   "\u3300"-"\u337f",
   "\u3400"-"\u4dbf",
   "\u4e00"-"\u9fff",
   "\uac00"-"\ud7a3",
   "\uf900"-"\ufaff",
   "\uff65"-"\uffdc"   
  ]
  >



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-478) CJK char list

2006-01-01 Thread John Wang (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361497 ] 

John Wang commented on LUCENE-478:
--

Yes I am.

Our i18n team has provided a more up-to-date list and I thought I'd contribute 
it back.

-John

> CJK char list
> -
>
>  Key: LUCENE-478
>  URL: http://issues.apache.org/jira/browse/LUCENE-478
>  Project: Lucene - Java
> Type: Bug
>   Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor

>
> Seems the character list in the CJK section of the StandardTokenizer.jj is 
> not quite complete. Following is a more complete list:
> < CJK:  // non-alphabets
>   [
>  "\u1100"-"\u11ff",
>"\u3040"-"\u30ff",
>"\u3130"-"\u318f",
>"\u31f0"-"\u31ff",
>"\u3300"-"\u337f",
>"\u3400"-"\u4dbf",
>"\u4e00"-"\u9fff",
>"\uac00"-"\ud7a3",
>"\uf900"-"\ufaff",
>"\uff65"-"\uffdc"   
>   ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-570) Expose directory on IndexReader

2006-05-15 Thread John Wang (JIRA)
Expose directory on IndexReader
---

 Key: LUCENE-570
 URL: http://issues.apache.org/jira/browse/LUCENE-570
 Project: Lucene - Java
Type: Improvement

  Components: Index  
Versions: 1.9
Reporter: John Wang
Priority: Trivial


It would be really useful to expose the index directory on the IndexReader 
class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2008-12-31 Thread John Wang (JIRA)
Adding FilteredDocIdSet and FilteredDocIdSetIterator


 Key: LUCENE-1506
 URL: https://issues.apache.org/jira/browse/LUCENE-1506
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: John Wang


Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2008-12-31 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1506:
--

Attachment: filteredDocidset.txt

> Adding FilteredDocIdSet and FilteredDocIdSetIterator
> 
>
> Key: LUCENE-1506
> URL: https://issues.apache.org/jira/browse/LUCENE-1506
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: filteredDocidset.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2008-12-31 Thread John Wang (JIRA)
adding EmptyDocIdSet/Iterator
-

 Key: LUCENE-1507
 URL: https://issues.apache.org/jira/browse/LUCENE-1507
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: John Wang


Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2008-12-31 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1507:
--

Attachment: emptydocidset.txt

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: emptydocidset.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2009-01-09 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662627#action_12662627
 ] 

John Wang commented on LUCENE-1345:
---

Added perf comparisons with boolean set iterators with current scorers

See patch

System: Ubunto, 
java version "1.6.0_11"
Intel core2 Duo 2.44ghz

new  milliseconds=470
new  milliseconds=534
new  milliseconds=450
new  milliseconds=443
new  milliseconds=444
new  milliseconds=445
new  milliseconds=449
new  milliseconds=441
new  milliseconds=444
new  milliseconds=445
new total milliseconds=4565
old  milliseconds=529
old  milliseconds=491
old  milliseconds=428
old  milliseconds=549
old  milliseconds=427
old  milliseconds=424
old  milliseconds=420
old  milliseconds=424
old  milliseconds=423
old  milliseconds=422
old total milliseconds=4537

New/Old Time 4565/4537 (100.61715%)
OrDocIdSetIterator  milliseconds=1138
OrDocIdSetIterator  milliseconds=1106
OrDocIdSetIterator  milliseconds=1065
OrDocIdSetIterator  milliseconds=1066
OrDocIdSetIterator  milliseconds=1065
OrDocIdSetIterator  milliseconds=1067
OrDocIdSetIterator  milliseconds=1072
OrDocIdSetIterator  milliseconds=1118
OrDocIdSetIterator  milliseconds=1065
OrDocIdSetIterator  milliseconds=1069
OrDocIdSetIterator total milliseconds=10831
DisjunctionMaxScorer  milliseconds=1914
DisjunctionMaxScorer  milliseconds=1981
DisjunctionMaxScorer  milliseconds=1861
DisjunctionMaxScorer  milliseconds=1893
DisjunctionMaxScorer  milliseconds=1886
DisjunctionMaxScorer  milliseconds=1885
DisjunctionMaxScorer  milliseconds=1887
DisjunctionMaxScorer  milliseconds=1889
DisjunctionMaxScorer  milliseconds=1891
DisjunctionMaxScorer  milliseconds=1888
DisjunctionMaxScorer total milliseconds=18975
Or/DisjunctionMax Time 10831/18975 (57.080368%)
OrDocIdSetIterator  milliseconds=1079
OrDocIdSetIterator  milliseconds=1075
OrDocIdSetIterator  milliseconds=1076
OrDocIdSetIterator  milliseconds=1093
OrDocIdSetIterator  milliseconds=1077
OrDocIdSetIterator  milliseconds=1074
OrDocIdSetIterator  milliseconds=1078
OrDocIdSetIterator  milliseconds=1075
OrDocIdSetIterator  milliseconds=1074
OrDocIdSetIterator  milliseconds=1074
OrDocIdSetIterator total milliseconds=10775
DisjunctionSumScorer  milliseconds=1398
DisjunctionSumScorer  milliseconds=1322
DisjunctionSumScorer  milliseconds=1320
DisjunctionSumScorer  milliseconds=1305
DisjunctionSumScorer  milliseconds=1304
DisjunctionSumScorer  milliseconds=1301
DisjunctionSumScorer  milliseconds=1304
DisjunctionSumScorer  milliseconds=1300
DisjunctionSumScorer  milliseconds=1301
DisjunctionSumScorer  milliseconds=1317
DisjunctionSumScorer total milliseconds=13172
Or/DisjunctionSum Time 10775/13172 (81.80231%)
AndDocIdSetIterator  milliseconds=330
AndDocIdSetIterator  milliseconds=336
AndDocIdSetIterator  milliseconds=298
AndDocIdSetIterator  milliseconds=299
AndDocIdSetIterator  milliseconds=310
AndDocIdSetIterator  milliseconds=298
AndDocIdSetIterator  milliseconds=298
AndDocIdSetIterator  milliseconds=334
AndDocIdSetIterator  milliseconds=298
AndDocIdSetIterator  milliseconds=299
AndDocIdSetIterator total milliseconds=3100
ConjunctionScorer  milliseconds=332
ConjunctionScorer  milliseconds=307
ConjunctionScorer  milliseconds=302
ConjunctionScorer  milliseconds=350
ConjunctionScorer  milliseconds=300
ConjunctionScorer  milliseconds=304
ConjunctionScorer  milliseconds=305
ConjunctionScorer  milliseconds=303
ConjunctionScorer  milliseconds=303
ConjunctionScorer  milliseconds=299
ConjunctionScorer total milliseconds=3105
And/Conjunction Time 3100/3105 (99.83897%)


> Allow Filter as clause to BooleanQuery
> --
>
> Key: LUCENE-1345
> URL: https://issues.apache.org/jira/browse/LUCENE-1345
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: DisjunctionDISI.java, DisjunctionDISI.patch, 
> DisjunctionDISI.patch, LUCENE-1345.patch, LUCENE-1345.patch, 
> OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, 
> TestIteratorPerf.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2009-01-09 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1345:
--

Attachment: booleansetperf.txt

Added And/Or/Not DocidSet/Iterators

code ported over from Kamikaze:
http://code.google.com/p/lucene-ext/

Perf test updated.

main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda


> Allow Filter as clause to BooleanQuery
> --
>
> Key: LUCENE-1345
> URL: https://issues.apache.org/jira/browse/LUCENE-1345
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: booleansetperf.txt, DisjunctionDISI.java, 
> DisjunctionDISI.patch, DisjunctionDISI.patch, LUCENE-1345.patch, 
> LUCENE-1345.patch, OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, 
> TestIteratorPerf.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2009-01-09 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662632#action_12662632
 ] 

John Wang commented on LUCENE-1345:
---

Given the perf number improvements we see, can we consider up the priority?

> Allow Filter as clause to BooleanQuery
> --
>
> Key: LUCENE-1345
> URL: https://issues.apache.org/jira/browse/LUCENE-1345
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: booleansetperf.txt, DisjunctionDISI.java, 
> DisjunctionDISI.patch, DisjunctionDISI.patch, LUCENE-1345.patch, 
> LUCENE-1345.patch, OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, 
> TestIteratorPerf.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2009-01-10 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662673#action_12662673
 ] 

John Wang commented on LUCENE-1345:
---

Filters by definition (afaik) does not participate in scoring. Since "score 
gathering" is done at the BooleanQuery level, does it mean BooleanQuery would 
need to do instanceof check to see if it is a Filter? 

Or do we always hardcode filter with score 0? This is also dangerous if people 
do augment scores at hitcollector level or score gathering logic changes to 
something not as straightforward as summing.

my two cents.

> Allow Filter as clause to BooleanQuery
> --
>
> Key: LUCENE-1345
> URL: https://issues.apache.org/jira/browse/LUCENE-1345
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 2.9
>
> Attachments: booleansetperf.txt, DisjunctionDISI.java, 
> DisjunctionDISI.patch, DisjunctionDISI.patch, LUCENE-1345.patch, 
> LUCENE-1345.patch, OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, 
> TestIteratorPerf.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2009-01-28 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668244#action_12668244
 ] 

John Wang commented on LUCENE-1506:
---

Filter calculates a DocSet given an IndexReader. Imagine a large index, and the 
logic to calculate whether it is in the set is non-trivial, so building this 
DocSet can be expensive.

So in the case where the driving query produces a very small result set, the 
validation can be performed only on the small set via the match call. 

Yes, in terms of functionality, one can do this with a filter, but it is 
wasteful to perform the validation calculation on the entire index where the 
candidates to be in the hits set is small.

> Adding FilteredDocIdSet and FilteredDocIdSetIterator
> 
>
> Key: LUCENE-1506
> URL: https://issues.apache.org/jira/browse/LUCENE-1506
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: filteredDocidset.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2009-01-29 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668668#action_12668668
 ] 

John Wang commented on LUCENE-1506:
---

sure, will work on that.

> Adding FilteredDocIdSet and FilteredDocIdSetIterator
> 
>
> Key: LUCENE-1506
> URL: https://issues.apache.org/jira/browse/LUCENE-1506
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: filteredDocidset.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2009-01-29 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1506:
--

Attachment: filteredDocidset2.txt

javadoc and unit test added

> Adding FilteredDocIdSet and FilteredDocIdSetIterator
> 
>
> Key: LUCENE-1506
> URL: https://issues.apache.org/jira/browse/LUCENE-1506
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: filteredDocidset.txt, filteredDocidset2.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1506) Adding FilteredDocIdSet and FilteredDocIdSetIterator

2009-01-30 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669105#action_12669105
 ] 

John Wang commented on LUCENE-1506:
---

Thanks Michael!

> Adding FilteredDocIdSet and FilteredDocIdSetIterator
> 
>
> Key: LUCENE-1506
> URL: https://issues.apache.org/jira/browse/LUCENE-1506
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: filteredDocidset.txt, filteredDocidset2.txt, 
> LUCENE-1506.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding 2 convenience classes: FilteredDocIdSet and FilteredDocIDSetIterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-24 Thread John Wang (JIRA)
expose lastDocId in the posting from the TermEnum API
-

 Key: LUCENE-1612
 URL: https://issues.apache.org/jira/browse/LUCENE-1612
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: John Wang


We currently have on the TermEnum api: docFreq() which gives the number docs in 
the posting.
It would be good to also have the max docid in the posting. That information is 
useful when construction a custom DocIdSet, .e.g determine sparseness of the 
doc list to decide whether or not to use a BitSet.

I have written a patch to do this, the problem with it is the TermInfosWriter 
encodes values in VInt/VLong, there is very little flexibility to add in 
lastDocId while making the index backward compatible. (If simple int is used 
for say, docFreq, a bit can be used to flag reading of a new piece of 
information)

output.writeVInt(ti.docFreq);   // write doc freq
output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
output.writeVLong(ti.proxPointer - lastTi.proxPointer);

Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1612:
--

Attachment: lucene-1612-patch.txt

Patch attach with test. Index is not backwards compatible.

> expose lastDocId in the posting from the TermEnum API
> -
>
> Key: LUCENE-1612
> URL: https://issues.apache.org/jira/browse/LUCENE-1612
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: lucene-1612-patch.txt
>
>
> We currently have on the TermEnum api: docFreq() which gives the number docs 
> in the posting.
> It would be good to also have the max docid in the posting. That information 
> is useful when construction a custom DocIdSet, .e.g determine sparseness of 
> the doc list to decide whether or not to use a BitSet.
> I have written a patch to do this, the problem with it is the TermInfosWriter 
> encodes values in VInt/VLong, there is very little flexibility to add in 
> lastDocId while making the index backward compatible. (If simple int is used 
> for say, docFreq, a bit can be used to flag reading of a new piece of 
> information)
> output.writeVInt(ti.docFreq);   // write doc freq
> output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
> output.writeVLong(ti.proxPointer - lastTi.proxPointer);
> Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
> TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)
TermEnum.docFreq() is not updated with there are deletes


 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang


TermEnum.docFreq is used in many places, especially scoring. However, if there 
are deletes in the index and it is not yet merged, this value is not updated.

Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1613:
--

Attachment: TestDeleteAndDocFreq.java

Test showing docFreq not updated when there are deletes.

> TermEnum.docFreq() is not updated with there are deletes
> 
>
> Key: LUCENE-1613
> URL: https://issues.apache.org/jira/browse/LUCENE-1613
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: TestDeleteAndDocFreq.java
>
>
> TermEnum.docFreq is used in many places, especially scoring. However, if 
> there are deletes in the index and it is not yet merged, this value is not 
> updated.
> Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702649#action_12702649
 ] 

John Wang commented on LUCENE-1613:
---

I understand this is a rather difficult problem to fix. I thought keeping a 
jira ticket would still be good for tracking purposes. Will let the committers 
decide on the urgency on this issue.

> TermEnum.docFreq() is not updated with there are deletes
> 
>
> Key: LUCENE-1613
> URL: https://issues.apache.org/jira/browse/LUCENE-1613
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: TestDeleteAndDocFreq.java
>
>
> TermEnum.docFreq is used in many places, especially scoring. However, if 
> there are deletes in the index and it is not yet merged, this value is not 
> updated.
> Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-25 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702691#action_12702691
 ] 

John Wang commented on LUCENE-1613:
---

Michael: We ran into this actually in facet search. When there is a null 
search, instead of counting on results on a MatchAllDocsQuery, we were just 
using docFreq() method to avoid facet counting. The problem came with there 
were updates. We did get around it, but was rather cumbersome.

I agree the fix is non-trivial, just wanted to open up an issue for tracking 
purposes incase we think of some thing.

> TermEnum.docFreq() is not updated with there are deletes
> 
>
> Key: LUCENE-1613
> URL: https://issues.apache.org/jira/browse/LUCENE-1613
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: TestDeleteAndDocFreq.java
>
>
> TermEnum.docFreq is used in many places, especially scoring. However, if 
> there are deletes in the index and it is not yet merged, this value is not 
> updated.
> Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-25 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702692#action_12702692
 ] 

John Wang commented on LUCENE-1612:
---

Excellent point Michael! What do you suggest on how to move forward with this?

> expose lastDocId in the posting from the TermEnum API
> -
>
> Key: LUCENE-1612
> URL: https://issues.apache.org/jira/browse/LUCENE-1612
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: lucene-1612-patch.txt
>
>
> We currently have on the TermEnum api: docFreq() which gives the number docs 
> in the posting.
> It would be good to also have the max docid in the posting. That information 
> is useful when construction a custom DocIdSet, .e.g determine sparseness of 
> the doc list to decide whether or not to use a BitSet.
> I have written a patch to do this, the problem with it is the TermInfosWriter 
> encodes values in VInt/VLong, there is very little flexibility to add in 
> lastDocId while making the index backward compatible. (If simple int is used 
> for say, docFreq, a bit can be used to flag reading of a new piece of 
> information)
> output.writeVInt(ti.docFreq);   // write doc freq
> output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
> output.writeVLong(ti.proxPointer - lastTi.proxPointer);
> Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
> TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API

2009-04-25 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702827#action_12702827
 ] 

John Wang commented on LUCENE-1612:
---

I am fine with waiting for LUCENE-1458. But Michael, then how would it help the 
merge of postings you described? Merging would be outside of the codec, no?

> expose lastDocId in the posting from the TermEnum API
> -
>
> Key: LUCENE-1612
> URL: https://issues.apache.org/jira/browse/LUCENE-1612
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: lucene-1612-patch.txt
>
>
> We currently have on the TermEnum api: docFreq() which gives the number docs 
> in the posting.
> It would be good to also have the max docid in the posting. That information 
> is useful when construction a custom DocIdSet, .e.g determine sparseness of 
> the doc list to decide whether or not to use a BitSet.
> I have written a patch to do this, the problem with it is the TermInfosWriter 
> encodes values in VInt/VLong, there is very little flexibility to add in 
> lastDocId while making the index backward compatible. (If simple int is used 
> for say, docFreq, a bit can be used to flag reading of a new piece of 
> information)
> output.writeVInt(ti.docFreq);   // write doc freq
> output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
> output.writeVLong(ti.proxPointer - lastTi.proxPointer);
> Anyway, patch is attached with:TestSegmentTermEnum modified to test this. 
> TestBackwardsCompatibility fails due to reasons described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1632) boolean docid set iterator improvement

2009-05-09 Thread John Wang (JIRA)
boolean docid set iterator improvement
--

 Key: LUCENE-1632
 URL: https://issues.apache.org/jira/browse/LUCENE-1632
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: John Wang
 Attachments: Lucene-1632-patch.txt

This was first brought up in Lucene-1345. But Lucene-1345 conversation has 
digressed. As per suggested, creating a separate issue to track.
Added perf comparisons with boolean set iterators with current scorers
See patch

System: Ubunto, 
java version "1.6.0_11"
Intel core2 Duo 2.44ghz

new milliseconds=470
new milliseconds=534
new milliseconds=450
new milliseconds=443
new milliseconds=444
new milliseconds=445
new milliseconds=449
new milliseconds=441
new milliseconds=444
new milliseconds=445
new total milliseconds=4565
old milliseconds=529
old milliseconds=491
old milliseconds=428
old milliseconds=549
old milliseconds=427
old milliseconds=424
old milliseconds=420
old milliseconds=424
old milliseconds=423
old milliseconds=422
old total milliseconds=4537

New/Old Time 4565/4537 (100.61715%)
OrDocIdSetIterator milliseconds=1138
OrDocIdSetIterator milliseconds=1106
OrDocIdSetIterator milliseconds=1065
OrDocIdSetIterator milliseconds=1066
OrDocIdSetIterator milliseconds=1065
OrDocIdSetIterator milliseconds=1067
OrDocIdSetIterator milliseconds=1072
OrDocIdSetIterator milliseconds=1118
OrDocIdSetIterator milliseconds=1065
OrDocIdSetIterator milliseconds=1069
OrDocIdSetIterator total milliseconds=10831
DisjunctionMaxScorer milliseconds=1914
DisjunctionMaxScorer milliseconds=1981
DisjunctionMaxScorer milliseconds=1861
DisjunctionMaxScorer milliseconds=1893
DisjunctionMaxScorer milliseconds=1886
DisjunctionMaxScorer milliseconds=1885
DisjunctionMaxScorer milliseconds=1887
DisjunctionMaxScorer milliseconds=1889
DisjunctionMaxScorer milliseconds=1891
DisjunctionMaxScorer milliseconds=1888
DisjunctionMaxScorer total milliseconds=18975
Or/DisjunctionMax Time 10831/18975 (57.080368%)
OrDocIdSetIterator milliseconds=1079
OrDocIdSetIterator milliseconds=1075
OrDocIdSetIterator milliseconds=1076
OrDocIdSetIterator milliseconds=1093
OrDocIdSetIterator milliseconds=1077
OrDocIdSetIterator milliseconds=1074
OrDocIdSetIterator milliseconds=1078
OrDocIdSetIterator milliseconds=1075
OrDocIdSetIterator milliseconds=1074
OrDocIdSetIterator milliseconds=1074
OrDocIdSetIterator total milliseconds=10775
DisjunctionSumScorer milliseconds=1398
DisjunctionSumScorer milliseconds=1322
DisjunctionSumScorer milliseconds=1320
DisjunctionSumScorer milliseconds=1305
DisjunctionSumScorer milliseconds=1304
DisjunctionSumScorer milliseconds=1301
DisjunctionSumScorer milliseconds=1304
DisjunctionSumScorer milliseconds=1300
DisjunctionSumScorer milliseconds=1301
DisjunctionSumScorer milliseconds=1317
DisjunctionSumScorer total milliseconds=13172
Or/DisjunctionSum Time 10775/13172 (81.80231%)
AndDocIdSetIterator milliseconds=330
AndDocIdSetIterator milliseconds=336
AndDocIdSetIterator milliseconds=298
AndDocIdSetIterator milliseconds=299
AndDocIdSetIterator milliseconds=310
AndDocIdSetIterator milliseconds=298
AndDocIdSetIterator milliseconds=298
AndDocIdSetIterator milliseconds=334
AndDocIdSetIterator milliseconds=298
AndDocIdSetIterator milliseconds=299
AndDocIdSetIterator total milliseconds=3100
ConjunctionScorer milliseconds=332
ConjunctionScorer milliseconds=307
ConjunctionScorer milliseconds=302
ConjunctionScorer milliseconds=350
ConjunctionScorer milliseconds=300
ConjunctionScorer milliseconds=304
ConjunctionScorer milliseconds=305
ConjunctionScorer milliseconds=303
ConjunctionScorer milliseconds=303
ConjunctionScorer milliseconds=299
ConjunctionScorer total milliseconds=3105
And/Conjunction Time 3100/3105 (99.83897%)


main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1632) boolean docid set iterator improvement

2009-05-09 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1632:
--

Attachment: Lucene-1632-patch.txt

> boolean docid set iterator improvement
> --
>
> Key: LUCENE-1632
> URL: https://issues.apache.org/jira/browse/LUCENE-1632
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: Lucene-1632-patch.txt
>
>
> This was first brought up in Lucene-1345. But Lucene-1345 conversation has 
> digressed. As per suggested, creating a separate issue to track.
> Added perf comparisons with boolean set iterators with current scorers
> See patch
> System: Ubunto, 
> java version "1.6.0_11"
> Intel core2 Duo 2.44ghz
> new milliseconds=470
> new milliseconds=534
> new milliseconds=450
> new milliseconds=443
> new milliseconds=444
> new milliseconds=445
> new milliseconds=449
> new milliseconds=441
> new milliseconds=444
> new milliseconds=445
> new total milliseconds=4565
> old milliseconds=529
> old milliseconds=491
> old milliseconds=428
> old milliseconds=549
> old milliseconds=427
> old milliseconds=424
> old milliseconds=420
> old milliseconds=424
> old milliseconds=423
> old milliseconds=422
> old total milliseconds=4537
> New/Old Time 4565/4537 (100.61715%)
> OrDocIdSetIterator milliseconds=1138
> OrDocIdSetIterator milliseconds=1106
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1066
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1067
> OrDocIdSetIterator milliseconds=1072
> OrDocIdSetIterator milliseconds=1118
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1069
> OrDocIdSetIterator total milliseconds=10831
> DisjunctionMaxScorer milliseconds=1914
> DisjunctionMaxScorer milliseconds=1981
> DisjunctionMaxScorer milliseconds=1861
> DisjunctionMaxScorer milliseconds=1893
> DisjunctionMaxScorer milliseconds=1886
> DisjunctionMaxScorer milliseconds=1885
> DisjunctionMaxScorer milliseconds=1887
> DisjunctionMaxScorer milliseconds=1889
> DisjunctionMaxScorer milliseconds=1891
> DisjunctionMaxScorer milliseconds=1888
> DisjunctionMaxScorer total milliseconds=18975
> Or/DisjunctionMax Time 10831/18975 (57.080368%)
> OrDocIdSetIterator milliseconds=1079
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1076
> OrDocIdSetIterator milliseconds=1093
> OrDocIdSetIterator milliseconds=1077
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1078
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator total milliseconds=10775
> DisjunctionSumScorer milliseconds=1398
> DisjunctionSumScorer milliseconds=1322
> DisjunctionSumScorer milliseconds=1320
> DisjunctionSumScorer milliseconds=1305
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1300
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1317
> DisjunctionSumScorer total milliseconds=13172
> Or/DisjunctionSum Time 10775/13172 (81.80231%)
> AndDocIdSetIterator milliseconds=330
> AndDocIdSetIterator milliseconds=336
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator milliseconds=310
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=334
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator total milliseconds=3100
> ConjunctionScorer milliseconds=332
> ConjunctionScorer milliseconds=307
> ConjunctionScorer milliseconds=302
> ConjunctionScorer milliseconds=350
> ConjunctionScorer milliseconds=300
> ConjunctionScorer milliseconds=304
> ConjunctionScorer milliseconds=305
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=299
> ConjunctionScorer total milliseconds=3105
> And/Conjunction Time 3100/3105 (99.83897%)
> main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1632) boolean docid set iterator improvement

2009-05-12 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708659#action_12708659
 ] 

John Wang commented on LUCENE-1632:
---

I think we have an improvement for ConjuctionScorer as well with about 10% 
improvement. We need to clean it up for a patch.

To make this clear, these are not algorithmic changes, there are code tuning 
work performed on the same algorithm.
The naming is used to be consistent with the current Lucene class names, e.g. 
DocIdSet, DocIdSetIterator.

Feel free to do more code tuning if you feel it would improve performance 
further.

> boolean docid set iterator improvement
> --
>
> Key: LUCENE-1632
> URL: https://issues.apache.org/jira/browse/LUCENE-1632
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: Lucene-1632-patch.txt
>
>
> This was first brought up in Lucene-1345. But Lucene-1345 conversation has 
> digressed. As per suggested, creating a separate issue to track.
> Added perf comparisons with boolean set iterators with current scorers
> See patch
> System: Ubunto, 
> java version "1.6.0_11"
> Intel core2 Duo 2.44ghz
> new milliseconds=470
> new milliseconds=534
> new milliseconds=450
> new milliseconds=443
> new milliseconds=444
> new milliseconds=445
> new milliseconds=449
> new milliseconds=441
> new milliseconds=444
> new milliseconds=445
> new total milliseconds=4565
> old milliseconds=529
> old milliseconds=491
> old milliseconds=428
> old milliseconds=549
> old milliseconds=427
> old milliseconds=424
> old milliseconds=420
> old milliseconds=424
> old milliseconds=423
> old milliseconds=422
> old total milliseconds=4537
> New/Old Time 4565/4537 (100.61715%)
> OrDocIdSetIterator milliseconds=1138
> OrDocIdSetIterator milliseconds=1106
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1066
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1067
> OrDocIdSetIterator milliseconds=1072
> OrDocIdSetIterator milliseconds=1118
> OrDocIdSetIterator milliseconds=1065
> OrDocIdSetIterator milliseconds=1069
> OrDocIdSetIterator total milliseconds=10831
> DisjunctionMaxScorer milliseconds=1914
> DisjunctionMaxScorer milliseconds=1981
> DisjunctionMaxScorer milliseconds=1861
> DisjunctionMaxScorer milliseconds=1893
> DisjunctionMaxScorer milliseconds=1886
> DisjunctionMaxScorer milliseconds=1885
> DisjunctionMaxScorer milliseconds=1887
> DisjunctionMaxScorer milliseconds=1889
> DisjunctionMaxScorer milliseconds=1891
> DisjunctionMaxScorer milliseconds=1888
> DisjunctionMaxScorer total milliseconds=18975
> Or/DisjunctionMax Time 10831/18975 (57.080368%)
> OrDocIdSetIterator milliseconds=1079
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1076
> OrDocIdSetIterator milliseconds=1093
> OrDocIdSetIterator milliseconds=1077
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1078
> OrDocIdSetIterator milliseconds=1075
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator milliseconds=1074
> OrDocIdSetIterator total milliseconds=10775
> DisjunctionSumScorer milliseconds=1398
> DisjunctionSumScorer milliseconds=1322
> DisjunctionSumScorer milliseconds=1320
> DisjunctionSumScorer milliseconds=1305
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1304
> DisjunctionSumScorer milliseconds=1300
> DisjunctionSumScorer milliseconds=1301
> DisjunctionSumScorer milliseconds=1317
> DisjunctionSumScorer total milliseconds=13172
> Or/DisjunctionSum Time 10775/13172 (81.80231%)
> AndDocIdSetIterator milliseconds=330
> AndDocIdSetIterator milliseconds=336
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator milliseconds=310
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=334
> AndDocIdSetIterator milliseconds=298
> AndDocIdSetIterator milliseconds=299
> AndDocIdSetIterator total milliseconds=3100
> ConjunctionScorer milliseconds=332
> ConjunctionScorer milliseconds=307
> ConjunctionScorer milliseconds=302
> ConjunctionScorer milliseconds=350
> ConjunctionScorer milliseconds=300
> ConjunctionScorer milliseconds=304
> ConjunctionScorer milliseconds=305
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=303
> ConjunctionScorer milliseconds=299
> ConjunctionScorer total milliseconds=3105
> And/Conjunction Time 3100/3105 (99.83897%)
> main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr.

[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708986#action_12708986
 ] 

John Wang commented on LUCENE-1634:
---

This is actually referring to the optimize(int) call, which selectively merges 
segments to make sure the total segments in the index is less or equal to the 
specified number.

> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708989#action_12708989
 ] 

John Wang commented on LUCENE-1634:
---

Comment on implementing a custom merge policy:
As the API current stands, I think the behavior is to assume a subclass of 
LogMergePolicy. And one cannot subclass LogMergePolicy without injecting the 
class into the org.apache.lucene.index package, (because the api signature: 
size(org.apache.lucene.index.SegmentInfo info), SegmentInfo is not an exposed 
API.

> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708995#action_12708995
 ] 

John Wang commented on LUCENE-1634:
---

The current lucene implementation, optimize(int) selects segments to merge 
based on the file size of the segment file: say the index has 10 segments, and 
optmize(6) is called,  Lucene finds 4 smallest segments by number of bytes in 
the segment files. 

This selection criteria is flawed because you can have a very large segment in 
terms of bytes but very small in terms of numDocs (if many deleted docs). 
Having these segment files around impacts performance considerably. 

This is what this patch is trying to fix this in a non-intrusive manner by 
extending the LogMergePolicy and by normalizing the calculation of the segment 
size including the delete count.


> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709002#action_12709002
 ] 

John Wang commented on LUCENE-1634:
---

RE: implementing custom MergePolicy
Let me describe in detail on problems of implementing a custom MergePolicy:

1) In IndexWriter code, such methods on MergePolicy is called, e.g. 
findMergesForOptimize. I believe that is the contract for implementing your own 
MergePolicy. However, it is "hidden" by the javadoc in terms of documentation, 
and furthermore, it is hidden because these methods are package protected. So 
to implement your own MergePolicy, you have to resort back to sneaking the 
class into the package.

2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is 
not used, also popular methods such as set/getMergeFactor etc. are only 
applicable to LogMergePolicy. (Just to clarify, useCompoundFile is a 
package-level protected method on the base MergePolicy class, so my guess is 
that set/getCompoundFile should be applicable to all implementations of 
MergePolicy.

This brings up another issue about the practice of having to "sneak" classes 
into a package. We are looking at making our Lucene code, OSGI compliant, and 
this becomes an issue because we cannot have multiple "bundles" exporting the 
same package. Which means, I would have to repackage lucene to include my 
classes that I have snuck into some lucene packages. I would like to use a 
standard distribution of  a lucene jar (as suggested/echoed by some luceners).


> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709003#action_12709003
 ] 

John Wang commented on LUCENE-1634:
---

>> So let's proceed with this patch, once you've added setter/getter.
Can you please elaborate on this? Add setter/getter on what? The number of 
target segment is already an input parameter? do you mean some sort of 
normalization factor on the how much to "punish" segments with high deleted 
docs?

> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

2009-05-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709024#action_12709024
 ] 

John Wang commented on LUCENE-1634:
---

>>I mean a setter/getter to turn on/off "taking deletions into account" in 
>>Log*MergePolicy.
Makes sense. What do you suggest the default behavior to be?
Also, do you think setter/getter is the right approach, since this is very much 
hidden to the API, e.g. one would have to do this:

LogMergePolicy policy = (LogMergePolicy)idxWriter.getMergePolicy();
policy.setTurnOnSegmentCalcWithDeletes(true);

DO you think instead, we can just add static setter/getter on LogMergePolicy 
class?


> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> --
>
> Key: LUCENE-1634
> URL: https://issues.apache.org/jira/browse/LUCENE-1634
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1642) IndexWriter.addIndexesNoOptimize ignores the compound file setting of the destination index

2009-05-18 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710620#action_12710620
 ] 

John Wang commented on LUCENE-1642:
---

2.4.0 branch: IndexWriter.java, method: resolveExternalSegments, line 3072:
  final MergePolicy.OneMerge newMerge = new 
MergePolicy.OneMerge(segmentInfos.range(i, 1+i), info.getUseCompoundFile());

info.getUseCompoundFile() seems to be wrong, this is on the src index, not the 
target index, should be getUseCompoundFile() instead.

> IndexWriter.addIndexesNoOptimize ignores the compound file setting of the 
> destination index
> ---
>
> Key: LUCENE-1642
> URL: https://issues.apache.org/jira/browse/LUCENE-1642
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Yasuhiro Matsuda
>Priority: Minor
>
> IndexWriter.addIndexesNoOptimize(Directory[]) ignores the compound file 
> setting of the destination index. It is using the compound file flags of 
> segments in the source indexes.
> This sometimes causes undesired increase of the number of files in the 
> destination index when non-compound file indexes are added until merge kicks 
> in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2007-11-19 Thread John Wang (JIRA)
Adding a factory to QueryParser to instantiate query instances
--

 Key: LUCENE-1061
 URL: https://issues.apache.org/jira/browse/LUCENE-1061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: John Wang
 Attachments: lucene_patch.txt

With the new efforts with Payload and scoring functions, it would be nice to 
plugin custom query implementations while using the same QueryParser.
Included is a patch with some refactoring the QueryParser to take a factory 
that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2007-11-19 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1061:
--

Attachment: lucene_patch.txt

This patch introduces a new file: QueryBuilder which is just a factory for 
instantiating query objects.

And the class QueryParser is modified to use use the factory to build the final 
query.

This is backward compatible.

> Adding a factory to QueryParser to instantiate query instances
> --
>
> Key: LUCENE-1061
> URL: https://issues.apache.org/jira/browse/LUCENE-1061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Reporter: John Wang
> Attachments: lucene_patch.txt
>
>
> With the new efforts with Payload and scoring functions, it would be nice to 
> plugin custom query implementations while using the same QueryParser.
> Included is a patch with some refactoring the QueryParser to take a factory 
> that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2007-11-20 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1061:
--

Fix Version/s: 2.3
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Affects Version/s: 2.3

> Adding a factory to QueryParser to instantiate query instances
> --
>
> Key: LUCENE-1061
> URL: https://issues.apache.org/jira/browse/LUCENE-1061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3
>Reporter: John Wang
> Fix For: 2.3
>
> Attachments: lucene_patch.txt
>
>
> With the new efforts with Payload and scoring functions, it would be nice to 
> plugin custom query implementations while using the same QueryParser.
> Included is a patch with some refactoring the QueryParser to take a factory 
> that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1246) Missing a null check in BooleanQuery.toString(String)

2008-03-27 Thread John Wang (JIRA)
Missing a null check in BooleanQuery.toString(String)
-

 Key: LUCENE-1246
 URL: https://issues.apache.org/jira/browse/LUCENE-1246
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: John Wang


Our queryParser/tokenizer in some situations creates null query and was added 
as a clause to Boolean query.
When we try to log the query, NPE is thrown from log(booleanQuery).

In BooleanQuery.toString(String), a simple null check is overlooked.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1246) Missing a null check in BooleanQuery.toString(String)

2008-03-27 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1246:
--

Attachment: BooleanQueryNPE.txt

Patch added to fix the NPE

> Missing a null check in BooleanQuery.toString(String)
> -
>
> Key: LUCENE-1246
> URL: https://issues.apache.org/jira/browse/LUCENE-1246
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.3.1
>Reporter: John Wang
> Attachments: BooleanQueryNPE.txt
>
>
> Our queryParser/tokenizer in some situations creates null query and was added 
> as a clause to Boolean query.
> When we try to log the query, NPE is thrown from log(booleanQuery).
> In BooleanQuery.toString(String), a simple null check is overlooked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2008-08-28 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626638#action_12626638
 ] 

John Wang commented on LUCENE-1061:
---

This looks great! Either subclassing or using a factory pattern works well in 
this case. Great job and thanks!

> Adding a factory to QueryParser to instantiate query instances
> --
>
> Key: LUCENE-1061
> URL: https://issues.apache.org/jira/browse/LUCENE-1061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3
>Reporter: John Wang
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1061.patch, LUCENE-1061.patch, lucene_patch.txt
>
>
> With the new efforts with Payload and scoring functions, it would be nice to 
> plugin custom query implementations while using the same QueryParser.
> Included is a patch with some refactoring the QueryParser to take a factory 
> that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-02 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652594#action_12652594
 ] 

John Wang commented on LUCENE-1473:
---

the fact an object implements Serializable implies this object can be 
serialized. It is a known good java programming practice to include a suid to 
the class (as a static variable) when the object declares itself to be 
Serializable. If it is not meant to be serialized, why did it implement 
Serializable. Furthermore, what is the reason to avoid it being serialized? I 
find the reason being the cost of support kinda ridiculous, seems this reason 
can be applied to any bug fix, because this at the end of the day, it is a bug.

I don't understand the issue of "extra bytes" to the term dictionary if the 
Term instance is not actually serialized to the index (at least I really hope 
that is not done)

The serialVersionUID (suid) is a long because it is a java thing. Here is a 
link to some information on the subject:
http://java.sun.com/developer/technicalArticles/Programming/serialization/

Use case: deploying lucene in a distributed environment, we have a 
broker/server architecture. (standard stuff), we want roll out search servers 
with lucene 2.4 instance by instance. The problem is that the broker is sending 
a Query object to the searcher via java serialization at the server level, and 
the broker is running 2.3. And because of specifically this problem, 2.3 
brokers cannot to talk to 2.4 search servers even when the Query object was not 
changed. 

To me, this is a very valid use-case. The problem was two different people did 
the release with different compilers.

At the risk of pissing off the Lucene powerhouse, I feel I have to express some 
candor. I am growing more and more frustrated with the lack of the open source 
nature of this project and its unwillingness to work with the developer 
community. This is a rather trivial issue, and it is taking 7 back-and-forth's 
to reiterate some standard Java behavior that has been around for years.

Lucene is a great project and has enjoyed great success, and I think it is to 
everyone's interest to make sure Lucene grows in a healthy environment.



> Implement Externalizable in main top level searcher classes
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>
> To maintain serialization compatibility between Lucene versions, major 
> classes can implement Externalizable.  This will make Serialization faster 
> due to no reflection required and maintain backwards compatibility.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653378#action_12653378
 ] 

John Wang commented on LUCENE-1473:
---

Mike:

   If you have class A implements Serializable, with a defined suid, say 1.

   Let A2 be a newer version of class A, and suid is not changed, say 1.

Let's say A2 has a new field.

   Imaging A is running in VM1 and A2 is running in VM2. Serialization 
between VM1 and VM2 of class A is ok, just that A will not get the new fields. 
Which is fine since VM1 does not make use of it. 

   You can argue that A2 will not get the needed field from serialized A, 
but isn't that better than crashing?

Either the case, I think the behavior is better than it is currently. 
(maybe that's why Eclipse and Findbug both report the lacking of suid 
definition in lucene code a warning)

   I agree adding Externalizable implementation is more work, but it would 
make the serialization story correct.

-John


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653545#action_12653545
 ] 

John Wang commented on LUCENE-1473:
---

The discussion here is whether it is better to have 100% of the time failing 
vs. 10% of the time failing. (these are just meaningless numbers to express a 
point)
I do buy Doug's comment about getting into a weird state due to data 
serialization, but this is something Externalizable would solve.
This discussion has digressed to general Java serialization design, where it 
originally scoped only to several lucene classes.

If it is documented that lucene only supports serialization of classes from the 
same jar, is that really enough, doesn't it also depend on the compiler, if 
someone were to build their own jar?

Furthermore, in a distributed environment with lotsa machines, it is always 
idea to upgrade bit by bit, is taking this functionality away by imposing this 
restriction a good trade-off to just implementing Externalizable for a few 
classes, if Serializable is deemed to be dangerous, which I am not so sure 
given the lucene classes we are talking about.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653563#action_12653563
 ] 

John Wang commented on LUCENE-1473:
---

For our problem, it is Query all all its derived and encapsulated classes. I 
guess the title of the bug is too generic.

As far as my comment about other lucene classes, one can just go to the lucene 
javadoc and click on "Tree" and look for Serializable. If you want me to, I can 
go an fetch the complete list, but here are some examples:

1) Document (Field etc.)
2) OpenBitSet, Filter ...
3) Sort, SortField
4) Term
5) TopDocs, Hits etc.

For the top level API.



> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-12-05 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786408#action_12786408
 ] 

John Wang commented on LUCENE-1526:
---

Yes, we still see the issue. The performance/stress test after 20+ min of run, 
latency spiked from 5ms to 550ms and file handle leakage was severe enough that 
the test crashed. This is the code:

http://code.google.com/p/zoie/source/browse/branches/BR_DELETE_OPT/java/proj/zoie/impl/indexing/luceneNRT/ThrottledLuceneNRTDataConsumer.java

Our logging indicates there is at most 3 index readers instances at open state. 
Yet the file handle count is very high.

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch, LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-12-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786750#action_12786750
 ] 

John Wang commented on LUCENE-1613:
---

Maybe to just add a javadoc comment on the call to explain the behavior in this 
case?

Many times calling docFreq happens in a readonly context, calling 
expungeDeletes in that context is not a good idea.

I agree it is not trivial to fix while keeping the performance. I don't mind 
closing the bug either.

> TermEnum.docFreq() is not updated with there are deletes
> 
>
> Key: LUCENE-1613
> URL: https://issues.apache.org/jira/browse/LUCENE-1613
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: TestDeleteAndDocFreq.java
>
>
> TermEnum.docFreq is used in many places, especially scoring. However, if 
> there are deletes in the index and it is not yet merged, this value is not 
> updated.
> Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-14 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790506#action_12790506
 ] 

John Wang commented on LUCENE-2120:
---

Hi Michael:

bq: Why does Zoie even retain 3 readers? Why not keep only the current one?

1 mem reader for when the disk batch, 1 mem reader for the time disk reader 
indexes, 1 disk reader

bq: It looks like the test uses both Wikipedia & Medline for document sources? 
Do I really need both?

By default, it only runs with Medline data. You don't need both. 
perf/settings/index.properties->data.type dictates which to use, file->medline, 
wiki->wikipedia

Also, 

You should use the branch: BR_DELETE_OPT

It has the optimization you suggested on handling deleted docs, e.g. should not 
check for each hit candidate with IntSetAccelerator.
Also, I have added a DataConsumer to handle delayed reopen for NRT case. You 
see the file handle leakage quickly with it: see perf/conf/zoie.properties to 
turn on ThrottledLuceneNRTDataConsumer.

On my mac, I use lsof to see the file handle count.

-John

> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2009-12-15 Thread John Wang (JIRA)
Tool to expand the index for perf/stress testing.
-

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang


Sometimes it is useful to take a small-ish index and expand it into a large 
index with K segments for perf/stress testing. 

This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2009-12-15 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2159:
--

Attachment: ExpandIndex.java

I have put it under contrib/misc, in package org.apache.lucene.index

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)
Tool to rename a field
--

 Key: LUCENE-2160
 URL: https://issues.apache.org/jira/browse/LUCENE-2160
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang


We found it useful to be able to rename a field.
It can save a lot of reindexing time/cost when being used in conjunction with 
ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2160:
--

Attachment: RenameField.java

part of the code was originally posted on nabble, but is not removed:
www.nabble.com/file/p15221929/fieldrename



> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-2007) Add DocsetQuery to turn a DocIdSet into a query

2009-12-15 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang closed LUCENE-2007.
-

Resolution: Won't Fix

> Add DocsetQuery to turn a DocIdSet into a query
> ---
>
> Key: LUCENE-2007
> URL: https://issues.apache.org/jira/browse/LUCENE-2007
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: John Wang
> Attachments: LUCENE-2007-2.patch, LUCENE-2007.patch
>
>
> Added a class DocsetQuery that can be constructed from a DocIdSet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790797#action_12790797
 ] 

John Wang commented on LUCENE-2160:
---

Good point. But do you ever sort across fields?

> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790809#action_12790809
 ] 

John Wang commented on LUCENE-2160:
---

Just did a test: 

You are right, IndexReader.terms(Term) would no longer find the rename field 
name if the field name is out of order. If the order is preserved, it is ok, 
e.g. list of fields "a","c","f", if renaming "c" -> "d", it would be ok.

Our use case is however this:

We messed up our data in say, field "c", we rename it to "c_bak", and create a 
parallel index with one field and name if "c". merge the indexes. c_bak is then 
never accessed.

Would this work?

> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java, RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2160:
--

Attachment: RenameField.java

Fixed a problem with cfs files.

> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java, RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-15 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790985#action_12790985
 ] 

John Wang commented on LUCENE-2120:
---

bq. is this what the private static int MAX_READER_GENERATION = 3

No, I misunderstood you. This is just a number I think it is safe to keep for 
make sure you don't close reader while it is being searched.

bq. you resolve the deleted docs in the BG

That is not really true. It is stored in a DocIdSet, and it does the skipping 
in a special TermDocs for avoid deletes.

bq. It's sort of a "warm in the background" tradeoff, ie, give me my reader 
very quickly, even if the first searches against it must run a bit slower since 
they double check deletions, until the warming is done vs Lucene which 
forcefully "warms" (making reopen time longer) before returning the reader to 
you.

I am not sure I understand what you mean by "double check deletions". Warming 
is done in the background, in the foreground you do search along with ram 
indexes that hold transient indexing updates. So one guarantee is that the any 
time search is happening the disk index is warm, so you don't have to pay the 
cost related to warming.


BTW, i have merged BR_DELETE_OPT down to trunk and did a release. Feel free to 
take from trunk.



> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791046#action_12791046
 ] 

John Wang commented on LUCENE-2160:
---

Did some more digging around the issue on field ordering. Is it possible to 
change FieldInfo file store to change the number in byNumber ArrayList along 
with the byName HashMap, and update the file? Or is the number already assumed 
to be in sort order from the tii file?

> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java, RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2160) Tool to rename a field

2009-12-15 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791051#action_12791051
 ] 

John Wang commented on LUCENE-2160:
---

Looked at the file format wiki more closely, I see front-coding applies to all 
the terms in all fields. So my above comment would not work.
Do you think it makes sense to have a tii,tis file for each indexed field? 
Would the new codec allow for it?

-John

> Tool to rename a field
> --
>
> Key: LUCENE-2160
> URL: https://issues.apache.org/jira/browse/LUCENE-2160
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: RenameField.java, RenameField.java
>
>
> We found it useful to be able to rename a field.
> It can save a lot of reindexing time/cost when being used in conjunction with 
> ParallelReader to update partially a field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-22 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793803#action_12793803
 ] 

John Wang commented on LUCENE-2120:
---

Yes we have done perf tests.
We see no indexing throughput improvement, query throughput improved by 40%.

> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-26 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794617#action_12794617
 ] 

John Wang commented on LUCENE-2120:
---

Michael:

I wrote a little test to measure and to understand the perf:

I compare the two methods:
static long testBits(BitVector bv,int numhits) throws Exception{
long start = System.nanoTime();
for (int i=0;i=nextDelDoc){
  if (i==nextDelDoc){
  }
  nextDelDoc = delIter.advance(i);
}
}
long end = System.nanoTime();
return (end-start);
}


I removed everything to the barebones to understand the perf implications.

Here are the results on my macbook pro, with numHits and del count:

5M 500: 
bits: 42417850
skip: 15234650

5M 100:
bits: 43053650
skip: 15268850

5M 10k:
bits: 41694350
skip: 17504900

5M 100k:
bits: 41737350
skip: 42966000

1M 1000:
bits: 8722700
skip: 3249100

1M 10k:
bits: 8210650
skip: 6119700

1M 25k:
bits: 8558150
skip: 9477850

You see BitVector starts to win with numDeletes's density at about 2%, and it 
is pretty consistent between diff numHits parameter.

So in real life scenario, we see that the numDeletes to be very small. However, 
it would be a great improvement if we can case it out depending on result set.

> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-26 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794618#action_12794618
 ] 

John Wang commented on LUCENE-2120:
---

I realized in my ArrayDocIdSet.skip, i am doing binary search always between 
index 0 and array.length-1, after optimizing this, the percentage is at 4% 
instead of 2%.

One important distinction is the memory consumption, the skip algorithm is much 
compact when the deleted set is sparse compare to the corpus.

> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2120) Possible file handle leak in near real-time reader

2009-12-28 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794998#action_12794998
 ] 

John Wang commented on LUCENE-2120:
---

Hi Michael:

 You are abs. right! By adding an integer increment in the loop made a 
difference. From my laptop, I see the break even point at 1.5%.

 For what we are using it for in Zoie, the trade-off is worth-while. 
Because we rely on Lucene to do the delete check, and use this iteration for 
skipping over transient deletes, e.g. the ones that have not been made to the 
index. Normally they are << corpus size, e.g. 100 - 1k out of 5M docs. The 
memory cost for this is also very small in comparison due to the sparsity of 
del docset.

-John 

> Possible file handle leak in near real-time reader
> --
>
> Key: LUCENE-2120
> URL: https://issues.apache.org/jira/browse/LUCENE-2120
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff of LUCENE-1526: Jake/John hit file descriptor exhaustion when testing 
> NRT.
> I've tried to repro this, stress testing NRT, saturating reopens, indexing, 
> searching, but haven't found any issue.
> Let's try to get to the bottom of it, here...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)
stored field retrieve slow
--

 Key: LUCENE-2252
 URL: https://issues.apache.org/jira/browse/LUCENE-2252
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 3.0
Reporter: John Wang


IndexReader.document() on a stored field is rather slow. Did a simple 
multi-threaded test and profiled it:

40+% time is spent in getting the offset from the index file
30+% time is spent in reading the count (e.g. number of fields to load)

Although I ran it on my lap top where the disk isn't that great, but still 
seems to be much room in improvement, e.g. load field index file into memory 
(for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
other stuff being loaded)

A related note, are there plans to have custom segments as part of flexible 
indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830599#action_12830599
 ] 

John Wang commented on LUCENE-2252:
---

Thanks Uwe for the pointer. Will check that out!

Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of 
data per doc. This memory would be less than the data structure needed to be 
held in memory for only 1 field cache entry for sort. I understand it is always 
better to use less memory, but sometimes we do have to make trade-off decisions.
But you are right, different applications have different needs/requirements, so 
having support for custom segments would be a good thing. e.g. LUCENE-1914

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830627#action_12830627
 ] 

John Wang commented on LUCENE-2252:
---

bq. I do not understand, I think the fdx index is the raw offset into fdt for 
some doc, and must remain a long if you have more than 2GB total across all 
docs.

as stated earlier,  assuming we are not storing 2GB of data per doc, you don't 
need to keep a long per doc. There are many ways of representing this without 
paying much performance penalty. Off the top of my head, this would work:

since positions are always positive, you can indicate using the first bit to 
see if MAX_INT is reached, if so, add MAX_INT to the masked bits. You get away 
with int per doc.

I am sure with there are other tons of neat stuff for this the Mikes or Yonik 
can come up with :)

bq. John, do you have a specific use case where this is the bottleneck, or are 
you just looking for places to optimize in general?

Hi Yonik, I understand this may not be a common use case. I am trying to use 
Lucene as a store solution. e.g. supporting just get()/put() operations as a 
content store. We wrote something simple in house and I compared it against 
lucene, and the difference was dramatic. So after profiling, just seems this is 
an area with lotsa room for improvement. (posted earlier)

Reasons:
1) Our current setup is that the content is stored outside of the search 
cluster. It just seems being able to fetch the data for rendering/highlighting 
within our search cluster would be good.
2) If the index contains the original data, changing indexing schema, e.g. 
reindexing can be done within each partition/node. Getting data from our 
authoratative datastore is expensive.

Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. 
If you also agree, I can just dup it over.

Thanks

-John


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830628#action_12830628
 ] 

John Wang commented on LUCENE-2252:
---

Sorry, I meant LUCENE-1914

> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2252) stored field retrieve slow

2010-02-06 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830641#action_12830641
 ] 

John Wang commented on LUCENE-2252:
---

bq. I still think 4 bytes/doc is too much (its too much wasted ram for 
virtually no gain)

That depends on the application. In modern machines (at least with the machines 
we are using, e.g. a macbook pro) we can afford it :) I am not sure I agree 
with "virtually no gain" if you look at the numbers I posted. IMHO, the gain is 
significant.

I hate to get into a subjective argument on this though.

bq. I dont understand why you need something like a custom segment file to do 
this, why cant you just simply use Directory to load this particular file into 
memory for your use case?

Having a custom segment allows me to not having to get into this subjective 
argument in what is too much memory or what is the gain, since it just depends 
on my application, right?

Furthermore, with the question at hand, even if we do use Directory 
implementation Uwe suggested, it is not optimal. For my use case, the cost of 
the seek/read for the count on the data file is very wasteful. Also even for 
getting position, I can just a random access into an array compare to a 
in-memory seek,read/parse.

The very simple store mechanism we have written outside of lucene has a gain of 
>85x, yes, 8500%, over lucene stored fields. We would like to however, take 
advantage of the some of the good stuff already in lucene, e.g.  merge 
mechanism (which is very nicely done), delete handling etc.


> stored field retrieve slow
> --
>
> Key: LUCENE-2252
> URL: https://issues.apache.org/jira/browse/LUCENE-2252
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0
>Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple 
> multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still 
> seems to be much room in improvement, e.g. load field index file into memory 
> (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to 
> other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible 
> indexing feature?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays

2009-08-01 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737950#action_12737950
 ] 

John Wang commented on LUCENE-1574:
---

Re: Zoie and deleted docs:
That is no longer true, Zoie is using a bloom filter over a intHash set from 
fastutil for exactly the perf reason Jason pointed.

> PooledSegmentReader, pools SegmentReader underlying byte arrays
> ---
>
> Key: LUCENE-1574
> URL: https://issues.apache.org/jira/browse/LUCENE-1574
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 3.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> PooledSegmentReader pools the underlying byte arrays of deleted docs and 
> norms for realtime search.  It is designed for use with IndexReader.clone 
> which can create many copies of byte arrays, which are of the same length for 
> a given segment.  When pooled they can be reused which could save on memory.  
> Do we want to benchmark the memory usage comparison of PooledSegmentReader vs 
> GC?  Many times GC is enough for these smaller objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1819) MatchAllDocsQuery.toString(String field) does not honor the javadoc contract

2009-08-17 Thread John Wang (JIRA)
MatchAllDocsQuery.toString(String field) does not honor the javadoc contract


 Key: LUCENE-1819
 URL: https://issues.apache.org/jira/browse/LUCENE-1819
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4.1
Reporter: John Wang


Should be 

public String toString(String field){
  return "*:*";
}

QueryParser needs to be able to parse the String form of this query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1819) MatchAllDocsQuery.toString(String field) does not honor the javadoc contract

2009-08-18 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744530#action_12744530
 ] 

John Wang commented on LUCENE-1819:
---

Thanks Mark for taking care of this issue!
w.r.t. this class, funny coincidence :)

What are you thoughts about QueryParser being able to know about custom Query 
implementations? E.g. if I were to write a MyQuery class and implemen the 
toString method a certain way, how would QueryParser know about MyQuery? Is it 
possible to extend QueryParser?

> MatchAllDocsQuery.toString(String field) does not honor the javadoc contract
> 
>
> Key: LUCENE-1819
> URL: https://issues.apache.org/jira/browse/LUCENE-1819
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4.1
>Reporter: John Wang
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1819.patch
>
>
> Should be 
> public String toString(String field){
>   return "*:*";
> }
> QueryParser needs to be able to parse the String form of this query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1914) allow for custom segment files

2009-09-17 Thread John Wang (JIRA)
allow for custom segment files
--

 Key: LUCENE-1914
 URL: https://issues.apache.org/jira/browse/LUCENE-1914
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: John Wang


Create a plugin framework where one can provide some sort of callback to add to 
a custom segment file, given a doc and provide some sort of merge logic. 
This is in light of the flexible indexing effort.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1922) exposing the ability to get the number of unique term count per field

2009-09-21 Thread John Wang (JIRA)
exposing the ability to get the number of unique term count per field
-

 Key: LUCENE-1922
 URL: https://issues.apache.org/jira/browse/LUCENE-1922
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: John Wang


Add an api to get the number of unique term count given a field name, e.g.:

IndexReader.getUniqueTermCount(String field)

This issue has a dependency on LUCENE-1458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-09-23 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758968#action_12758968
 ] 

John Wang commented on LUCENE-1458:
---

This is awesome!
Feel free to take code from Kamikaze for the p4delta stuff.
The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can 
be traversed in compressed form.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1924) BalancedSegmentMergePolicy, contributed from the Zoie project for realtime indexing

2009-09-23 Thread John Wang (JIRA)
BalancedSegmentMergePolicy, contributed from the Zoie project for realtime 
indexing
---

 Key: LUCENE-1924
 URL: https://issues.apache.org/jira/browse/LUCENE-1924
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: John Wang
 Attachments: BalancedSegmentMergePolicy.java

Written by Yasuhiro Matsuda for Zoie realtime indexing system used to handle 
high update rates to avoid large segment merges.
Detailed write-up is at:

http://code.google.com/p/zoie/wiki/ZoieMergePolicy


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1924) BalancedSegmentMergePolicy, contributed from the Zoie project for realtime indexing

2009-09-23 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1924:
--

Attachment: BalancedSegmentMergePolicy.java

this is a stand-alone class

> BalancedSegmentMergePolicy, contributed from the Zoie project for realtime 
> indexing
> ---
>
> Key: LUCENE-1924
> URL: https://issues.apache.org/jira/browse/LUCENE-1924
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: BalancedSegmentMergePolicy.java
>
>
> Written by Yasuhiro Matsuda for Zoie realtime indexing system used to handle 
> high update rates to avoid large segment merges.
> Detailed write-up is at:
> http://code.google.com/p/zoie/wiki/ZoieMergePolicy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1925) In IndexSearcher class, make subReader and docCount arrays protected so sub classes can access them

2009-09-23 Thread John Wang (JIRA)
In IndexSearcher class, make subReader and docCount arrays protected so sub 
classes can access them
---

 Key: LUCENE-1925
 URL: https://issues.apache.org/jira/browse/LUCENE-1925
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: John Wang


Please make these two member variables protected so subclasses can access them, 
e.g.:

  protected IndexReader[] subReaders;
  protected int[] docStarts;

Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1924) BalancedSegmentMergePolicy, contributed from the Zoie project for realtime indexing

2009-09-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759108#action_12759108
 ] 

John Wang commented on LUCENE-1924:
---

I had put them in the core package under org.apache.lucene.index.

Because it requires access package protected classes, e.g. SegmentInfo, it 
needs to be in that package.

In terms of which module, totally up to you as you would know what the best 
place to put it.

A question on the side, MergePolicy is something the API suggests that is 
customizable, yet SegmentInfo, part of the signature is package protected. Is 
this something should be opened up to allow for full customization of 
MergePolicy amonst other things?

Thanks

-John

> BalancedSegmentMergePolicy, contributed from the Zoie project for realtime 
> indexing
> ---
>
> Key: LUCENE-1924
> URL: https://issues.apache.org/jira/browse/LUCENE-1924
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: BalancedSegmentMergePolicy.java
>
>
> Written by Yasuhiro Matsuda for Zoie realtime indexing system used to handle 
> high update rates to avoid large segment merges.
> Detailed write-up is at:
> http://code.google.com/p/zoie/wiki/ZoieMergePolicy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-09-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759110#action_12759110
 ] 

John Wang commented on LUCENE-1458:
---

Hi Mike:

 We have been using Kamikaze in our social graph engine in addition to our 
search system. A person's network can be rather large, decompressing it in 
memory some network operation is not feasible for us, hence we made the 
requirement for the DocIdSetIterator to be able to walk to DocIdSet's P4Delta 
implementation in compressed form.

 We do not decode the p4delta set and make a second pass for boolean set 
operations, we cannot afford it in both memory cost and latency. The P4Delta 
set adheres to the DocIdSet/Iterator api, and the And/Or/Not is performed on 
that level of abstraction using next() and skipTo methods.

-John


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-09-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759112#action_12759112
 ] 

John Wang commented on LUCENE-1458:
---

Just a FYI: Kamikaze was originally started as our sandbox for Lucene 
contributions until 2.4 is ready. (we needed the DocIdSet/Iterator abstraction 
that was migrated from Solr) 

It has three components:

1) P4Delta
2) Logical boolean operations on DocIdSet/Iterators (I have created a jira 
ticket and a patch for Lucene awhile ago with performance numbers. It is 
significantly faster than DisjunctionScorer)
3) algorithm to determine which DocIdSet implementations to use given some 
parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the 
application behavior if not all parameters are given.

So please feel free to incorporate anything you see if or move it to contrib.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-09-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759116#action_12759116
 ] 

John Wang commented on LUCENE-1458:
---

Hi Uwe:

 Thanks for the pointer to the isCacheable method. We will defn incorporate 
it.

-John


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1924) BalancedSegmentMergePolicy, contributed from the Zoie project for realtime indexing

2009-09-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759132#action_12759132
 ] 

John Wang commented on LUCENE-1924:
---

Awesome, didn't realize!

Thanks

-John

> BalancedSegmentMergePolicy, contributed from the Zoie project for realtime 
> indexing
> ---
>
> Key: LUCENE-1924
> URL: https://issues.apache.org/jira/browse/LUCENE-1924
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: BalancedSegmentMergePolicy.java
>
>
> Written by Yasuhiro Matsuda for Zoie realtime indexing system used to handle 
> high update rates to avoid large segment merges.
> Detailed write-up is at:
> http://code.google.com/p/zoie/wiki/ZoieMergePolicy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-09-26 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759910#action_12759910
 ] 

John Wang commented on LUCENE-1458:
---

Hi Mike:

 Truly awesome work!

 Quick question, are codecs per index or per field? From the wiki, it seems 
to be per index, if so, is it possible to make it per field?

Thanks

-John

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1931) no hits query - query object that returns no hits

2009-09-28 Thread John Wang (JIRA)
no hits query - query object that returns no hits
-

 Key: LUCENE-1931
 URL: https://issues.apache.org/jira/browse/LUCENE-1931
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: John Wang
 Attachments: nohitsquery.patch

Query implementation that return no hits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1931) no hits query - query object that returns no hits

2009-09-28 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1931:
--

Attachment: nohitsquery.patch

2 classes:

NoHitsQuery.java: query implementation
TestNoHitsQuery.java: unit test

> no hits query - query object that returns no hits
> -
>
> Key: LUCENE-1931
> URL: https://issues.apache.org/jira/browse/LUCENE-1931
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: nohitsquery.patch
>
>
> Query implementation that return no hits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1931) no hits query - query object that returns no hits

2009-09-28 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760268#action_12760268
 ] 

John Wang commented on LUCENE-1931:
---

Good to know!

Thanks

-John

> no hits query - query object that returns no hits
> -
>
> Key: LUCENE-1931
> URL: https://issues.apache.org/jira/browse/LUCENE-1931
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: nohitsquery.patch
>
>
> Query implementation that return no hits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-05 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762224#action_12762224
 ] 

John Wang commented on LUCENE-1458:
---

Hi Yonik:

 These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we 
can add 2) to the same issue to track?

Thanks

-John

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1969) adding kamikaze to lucene contrib

2009-10-10 Thread John Wang (JIRA)
adding kamikaze to lucene contrib
-

 Key: LUCENE-1969
 URL: https://issues.apache.org/jira/browse/LUCENE-1969
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.9
Reporter: John Wang


Adding kamikaze to lucene contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1969) adding kamikaze to lucene contrib

2009-10-10 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1969:
--

Attachment: kamikaze-contrib.patch

kamikaze contrib

> adding kamikaze to lucene contrib
> -
>
> Key: LUCENE-1969
> URL: https://issues.apache.org/jira/browse/LUCENE-1969
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: kamikaze-contrib.patch
>
>
> Adding kamikaze to lucene contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1969) adding kamikaze to lucene contrib

2009-10-13 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765306#action_12765306
 ] 

John Wang commented on LUCENE-1969:
---

My bad! The build.xml is not updated with the package name changes. I will 
update post the fixed build.xml.

> adding kamikaze to lucene contrib
> -
>
> Key: LUCENE-1969
> URL: https://issues.apache.org/jira/browse/LUCENE-1969
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: kamikaze-contrib.patch
>
>
> Adding kamikaze to lucene contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1969) adding kamikaze to lucene contrib

2009-10-13 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1969:
--

Attachment: build.xml

updated build.xml with package name changes.

> adding kamikaze to lucene contrib
> -
>
> Key: LUCENE-1969
> URL: https://issues.apache.org/jira/browse/LUCENE-1969
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: John Wang
> Attachments: build.xml, kamikaze-contrib.patch
>
>
> Adding kamikaze to lucene contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1969) adding kamikaze to lucene contrib

2009-10-13 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1969:
--

Attachment: kamikaze.contrib.patch2

again it was the package name.
redid local run and all tests pass.
sorry about the back and forth.

Re: Michael
I selected "Grant license to ASF..." radio button. And since kamikaze is 
already licensed under Apache 2.0. Is that form still needed? Since Kamikaze is 
contributed by LinkedIn, I am not sure who should be signing that form.

Re: Yonik
Which package do you mean?

> adding kamikaze to lucene contrib
> -
>
> Key: LUCENE-1969
> URL: https://issues.apache.org/jira/browse/LUCENE-1969
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: build.xml, kamikaze-contrib.patch, 
> kamikaze.contrib.patch2, kamikaze.test.out
>
>
> Adding kamikaze to lucene contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769045#action_12769045
 ] 

John Wang commented on LUCENE-1997:
---

My machine HW spec:

Model Name: MacBook Pro
  Model Identifier: MacBookPro3,1
  Processor Name:   Intel Core 2 Duo
  Processor Speed:  2.4 GHz
  Number Of Processors: 1
  Total Number Of Cores:2
  L2 Cache: 4 MB
  Memory:   4 GB
  Bus Speed:800 MHz

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769090#action_12769090
 ] 

John Wang commented on LUCENE-1997:
---

bq: topn:100
I had made changes to sortBench.py to look at each run. And forgot to add back 
in 100 :) My bad.


> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-23 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769116#action_12769116
 ] 

John Wang commented on LUCENE-1997:
---

I think I found the reason for the discrepancy: 32 vs 64 bit:

32-bit, run
jwang-mn:benchmark jwang$ python -u sortBench.py -report john3

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log||100|rand string|10|92.24|103.65|{color:green}12.4%{color}|
|log||100|rand string|25|91.88|102.06|{color:green}11.1%{color}|
|log||100|rand string|50|91.72|99.07|{color:green}8.0%{color}|
|log||100|rand string|100|106.26|90.61|{color:red}-14.7%{color}|
|log||100|rand string|500|86.38|59.88|{color:red}-30.7%{color}|
|log||100|rand string|1000|74.88|39.93|{color:red}-46.7%{color}|
|log||100|country|10|92.33|103.79|{color:green}12.4%{color}|
|log||100|country|25|92.27|101.60|{color:green}10.1%{color}|
|log||100|country|50|91.58|99.14|{color:green}8.3%{color}|
|log||100|country|100|100.76|82.25|{color:red}-18.4%{color}|
|log||100|country|500|75.18|48.65|{color:red}-35.3%{color}|
|log||100|country|1000|67.68|32.67|{color:red}-51.7%{color}|
|log||100|rand int|10|88.14|101.93|{color:green}15.6%{color}|
|log||100|rand int|25|95.02|96.14|{color:green}1.2%{color}|
|log||100|rand int|50|96.54|89.61|{color:red}-7.2%{color}|
|log||100|rand int|100|88.58|92.06|{color:green}3.9%{color}|
|log||100|rand int|500|103.60|62.25|{color:red}-39.9%{color}|
|log||100|rand int|1000|92.36|40.84|{color:red}-55.8%{color}|

64bit run:
jwang-mn:benchmark jwang$ python -u sortBench.py -report john4

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log||100|rand string|10|119.59|107.52|{color:red}-10.1%{color}|
|log||100|rand string|25|119.25|105.05|{color:red}-11.9%{color}|
|log||100|rand string|50|117.22|101.99|{color:red}-13.0%{color}|
|log||100|rand string|100|95.78|86.19|{color:red}-10.0%{color}|
|log||100|rand string|500|76.05|54.71|{color:red}-28.1%{color}|
|log||100|rand string|1000|68.37|38.94|{color:red}-43.0%{color}|
|log||100|country|10|119.68|108.12|{color:red}-9.7%{color}|
|log||100|country|25|119.10|105.72|{color:red}-11.2%{color}|
|log||100|country|50|115.85|99.70|{color:red}-13.9%{color}|
|log||100|country|100|97.44|91.03|{color:red}-6.6%{color}|
|log||100|country|500|78.92|40.97|{color:red}-48.1%{color}|
|log||100|country|1000|68.48|30.43|{color:red}-55.6%{color}|
|log||100|rand int|10|121.64|108.82|{color:red}-10.5%{color}|
|log||100|rand int|25|121.68|113.92|{color:red}-6.4%{color}|
|log||100|rand int|50|120.80|110.45|{color:red}-8.6%{color}|
|log||100|rand int|100|101.36|95.68|{color:red}-5.6%{color}|
|log||100|rand int|500|90.15|60.29|{color:red}-33.1%{color}|
|log||100|rand int|1000|80.23|40.67|{color:red}-49.3%{color}|



> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-23 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769119#action_12769119
 ] 

John Wang commented on LUCENE-1997:
---

wrote a small test and verified that 64bit vm's string compare is much faster 
than that of 32-bit. (kinda makes sense)
and the above numbers now all make sense.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2007) Add DocsetQuery to turn a DocIdSet into a query

2009-10-24 Thread John Wang (JIRA)
Add DocsetQuery to turn a DocIdSet into a query
---

 Key: LUCENE-2007
 URL: https://issues.apache.org/jira/browse/LUCENE-2007
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: John Wang


Added a class DocsetQuery that can be constructed from a DocIdSet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2007) Add DocsetQuery to turn a DocIdSet into a query

2009-10-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2007:
--

Attachment: LUCENE-2007.patch

contributed from bobo.
Still work needed:

1) reader.isDeleted is now called. Which perhaps should be optimized to avoid 
synchronized call.
2) don''t what the query syntax would be for this, so toString is not properly 
implemented.

> Add DocsetQuery to turn a DocIdSet into a query
> ---
>
> Key: LUCENE-2007
> URL: https://issues.apache.org/jira/browse/LUCENE-2007
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: John Wang
> Attachments: LUCENE-2007.patch
>
>
> Added a class DocsetQuery that can be constructed from a DocIdSet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2007) Add DocsetQuery to turn a DocIdSet into a query

2009-10-24 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2007:
--

Attachment: LUCENE-2007-2.patch

fixed to use reader.termDocs(null) for delete check.


> Add DocsetQuery to turn a DocIdSet into a query
> ---
>
> Key: LUCENE-2007
> URL: https://issues.apache.org/jira/browse/LUCENE-2007
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: John Wang
> Attachments: LUCENE-2007-2.patch, LUCENE-2007.patch
>
>
> Added a class DocsetQuery that can be constructed from a DocIdSet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2007) Add DocsetQuery to turn a DocIdSet into a query

2009-10-24 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769674#action_12769674
 ] 

John Wang commented on LUCENE-2007:
---

Both Paul and Uwe are absolutely correct!
I had blindly add this without looking around to see other solutions within 
Lucene.
The constructor of ConstantScoreQuery using Filter is exactly the answer to 
Uwe's question on segmented readers.
Thanks!

-John

> Add DocsetQuery to turn a DocIdSet into a query
> ---
>
> Key: LUCENE-2007
> URL: https://issues.apache.org/jira/browse/LUCENE-2007
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: John Wang
> Attachments: LUCENE-2007-2.patch, LUCENE-2007.patch
>
>
> Added a class DocsetQuery that can be constructed from a DocIdSet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-02 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772710#action_12772710
 ] 

John Wang commented on LUCENE-1997:
---

Hi Michael:

Any plans/decisions on moving forward with multiQ within Lucene? I am 
planning on making the change locally for my project, but I would rather not 
duplicate the work if you are planning on doing this within lucene.

Thanks

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-02 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772754#action_12772754
 ] 

John Wang commented on LUCENE-1997:
---

Hi Michael:

Thanks for the heads up. I will work on it locally then.

I am a bit confused here with memory, since most users don't go beyond page 
one, I can't see memory is even a concern here comparing to the amount of 
memory lucene uses overall. Am I missing something?

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >