Question on Lucene search

2009-01-18 Thread fell

Hi all, 

I am new to Lucene and I need to know the following: 

In case I have indexed some data using Lucene and it contains the fields: 
Location, City, Country. 

Suppose the data is as follows in the index in each of the above fields: 
1) R G Heights 
2) London 
3) United Kindom 

If i try to search the index by putting the following in my query : 
1) RG Heights (Please not R and G do not have space in the middle) or 
2) RGHeights. (no space at all) or   
3) R  G  Heights. (extra space between tokens), 
4) Kingdom United. 

Please tell me if lucene would come up with a positive result or would it
tell me 'no hits'. 

Please let me know this for each of the queries above! 

Thanks!
-- 
View this message in context: 
http://www.nabble.com/Question-on-Lucene-search-tp21537509p21537509.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller
One more just for a check with much fewer unique terms (20k). Didn't 
catch that I didnt clamp down enough on the uniques last one. Back up to 
21 segments this time, same wildcard search, 7718 hits, and the new 
method is still approx 20% faster than the old. The last run was 16 
segments though with way more uniques - this one is 21 segments and way 
fewer uniques.



7718
Segments file=segments_l numSegments=21 version=FORMAT_USER_DATA [Lucene 
2.9]

 1 of 21: name=_bbxo docCount=29349
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.92
   docStoreOffset=0
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3875263 terms/docs pairs; 
4516618 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 2 of 21: name=_bbxp docCount=29459
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.982
   docStoreOffset=29349
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3895590 terms/docs pairs; 
4540859 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 3 of 21: name=_bbxq docCount=29300
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.97
   docStoreOffset=58808
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3890419 terms/docs pairs; 
4536052 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 4 of 21: name=_bbxr docCount=29480
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.971
   docStoreOffset=88108
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3894211 terms/docs pairs; 
4538397 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 5 of 21: name=_bbxs docCount=29470
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.979
   docStoreOffset=117588
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3895226 terms/docs pairs; 
4540446 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 6 of 21: name=_bbxt docCount=29450
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.98
   docStoreOffset=147058
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3892708 terms/docs pairs; 
4538338 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 7 of 21: name=_bbxu docCount=29509
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.978
   docStoreOffset=176508
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3894189 terms/docs pairs; 
4538376 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 8 of 21: name=_bbxv docCount=29401
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.976
   docStoreOffset=206017
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [21840 terms; 3891986 terms/docs pairs; 
4538746 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 9 of 21: name=_bbxw docCount=29476
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=11.988
   docStoreOffset=235418
   docStoreSegment=_bbxo
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields,

[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 1/18/09 8:20 PM:
--

My previous results had a few oddities going with them (I was loosely playing 
around). Being a little more careful, here is an example of the difference, and 
the hotspots. Timings are probably not completely comparable as my comp couldnt 
keep up profiling the second version very well - its much slower without 
profiling as well though:

Index is 60 docs, 46 segments

Load the fieldcache on one multireader

||method||time||invocations||
|FieldCacheImpl.createValue|156536(98%)|1|
|MultiTermDocs.next()|148499(93.5%)|621803|
|MutliTermDocs(int)|140397(88.4%)|1002938|
|SegmentTermDocs.seek(Term)|138332(87.1%)|1002938|

load the fieldcache on each sub reader of the multireader, one at a time

||method||time||invocations||
|FieldCacheImpl.createValue|7815(80.4%)|46|
|SegmentTermDocs.next()|3315(34.1%)|642046|
|SegmentTermEnum.next()|1936(19.9%)|42046|
|SegmentTermDocs.seek(TermEnum)|874(9%)|42046|


*edit*
wrong values





  was (Author: markrmil...@gmail.com):
My previous results had a few oddities going with them (I was loosely 
playing around). Being a little more careful, here is an example of the 
difference, and the hotspots. Timings are probably not completely comparable as 
my comp couldnt keep up profiling the second version very well - its much 
slower without profiling as well though:

Index is 60 docs, 46 segments, 63849 unique terms.

Load the fieldcache on one multireader

||method||time||invocations||
|FieldCacheImpl.createValue|156536(98%)|1|
|MultiTermDocs.next()|148499(93.5%)|621803|
|MutliTermDocs(int)|140397(88.4%)|1002938|
|SegmentTermDocs.seek(Term)|138332(87.1%)|1002938|

load the fieldcache on each sub reader of the multireader, one at a time

||method||time||invocations||
|FieldCacheImpl.createValue|7815(80.4%)|46|
|SegmentTermDocs.next()|3315(34.1%)|642046|
|SegmentTermEnum.next()|1936(19.9%)|42046|
|SegmentTermDocs.seek(TermEnum)|874(9%)|42046|


*edit*
wrong values




  
> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller
Man, I'm not paying attention. I switched the analyzer but didnt take it 
off UNANALYZED. Here are the correct results. Its actually only like 
20-30% faster for the index I used. So a lot of that could be the gains 
that we were seeing in general anyway. Perhaps a bit more too though. 
Still a total of 60 docs.




   WildcardQuery query = new WildcardQuery(new Term("string", "00011*"));
   query.setConstantScoreRewrite(true);

   TopDocs results = searcher.search(query, 10);

26817 hits


Segments file=segments_h numSegments=16 version=FORMAT_USER_DATA [Lucene 
2.9]

 1 of 16: name=_bbs7 docCount=114352
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=92.409
   docStoreOffset=0
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [488265 terms; 25989763 terms/docs 
pairs; 29071032 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 2 of 16: name=_bbtc docCount=114210
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=92.42
   docStoreOffset=114352
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [488268 terms; 25992529 terms/docs 
pairs; 29076321 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 3 of 16: name=_bbuh docCount=114299
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=92.42
   docStoreOffset=228562
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [488263 terms; 25997224 terms/docs 
pairs; 29078745 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 4 of 16: name=_bbvm docCount=114177
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=92.444
   docStoreOffset=342861
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [488267 terms; 25992316 terms/docs 
pairs; 29080800 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 5 of 16: name=_bbwr docCount=114308
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=92.452
   docStoreOffset=457038
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [488268 terms; 26005497 terms/docs 
pairs; 29088895 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 6 of 16: name=_bbws docCount=2760
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=3.26
   docStoreOffset=571346
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [168214 terms; 647045 terms/docs pairs; 
724715 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 7 of 16: name=_bbwt docCount=2820
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=3.259
   docStoreOffset=574106
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [168255 terms; 649189 terms/docs pairs; 
725671 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 8 of 16: name=_bbwu docCount=2844
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=3.263
   docStoreOffset=576926
   docStoreSegment=_bbr3
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [168210 terms; 649731 terms/docs pairs; 
727095 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 9 of 16: name=_bbwv docCount=2871
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=3.266
   docStore

Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller

Oh yeah, thats a constantscore wildcard query.

Mark Miller wrote:
Just checked it out, and its not a bad win on multi term queries. Its 
not the same exponential gain as field cache loading, but I bet lots 
of 2-3x type stuff. You appear to save a decent amount by not applying 
every term to each segment because of the logarithmic sizing.


My query of: new WildcardQuery(new Term("string", "00*") gets 789 
hits, and takes half the time with this patch. The index is below:


Segments file=segments_a numSegments=34 version=FORMAT_USER_DATA 
[Lucene 2.9]

 1 of 34: name=_bb48 docCount=159845
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.529
   docStoreOffset=0
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159845 terms; 159845 terms/docs 
pairs; 159845 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 2 of 34: name=_bb5d docCount=159977
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.681
   docStoreOffset=159845
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159977 terms; 159977 terms/docs 
pairs; 159977 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 3 of 34: name=_bb6i docCount=159691
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.701
   docStoreOffset=319822
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159691 terms; 159691 terms/docs 
pairs; 159691 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 4 of 34: name=_bb6j docCount=3978
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.382
   docStoreOffset=479513
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3978 terms; 3978 terms/docs pairs; 
3978 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 5 of 34: name=_bb6k docCount=4002
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.353
   docStoreOffset=483491
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [4002 terms; 4002 terms/docs pairs; 
4002 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 6 of 34: name=_bb6l docCount=3959
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.365
   docStoreOffset=487493
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3959 terms; 3959 terms/docs pairs; 
3959 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 7 of 34: name=_bb6m docCount=3938
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.35
   docStoreOffset=491452
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3938 terms; 3938 terms/docs pairs; 
3938 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 8 of 34: name=_bb6n docCount=4020
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.379
   docStoreOffset=495390
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [4020 terms; 4020 terms/docs pairs; 
4020 tokens]
   test: stored fields...OK [0 total field count; avg 0 fields per 
doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 9 of 34: name=_bb6o docCount=3973
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.385
   docStoreOffset=499410
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true

Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller
Just checked it out, and its not a bad win on multi term queries. Its 
not the same exponential gain as field cache loading, but I bet lots of 
2-3x type stuff. You appear to save a decent amount by not applying 
every term to each segment because of the logarithmic sizing.


My query of: new WildcardQuery(new Term("string", "00*") gets 789 hits, 
and takes half the time with this patch. The index is below:


Segments file=segments_a numSegments=34 version=FORMAT_USER_DATA [Lucene 
2.9]

 1 of 34: name=_bb48 docCount=159845
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.529
   docStoreOffset=0
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159845 terms; 159845 terms/docs pairs; 
159845 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 2 of 34: name=_bb5d docCount=159977
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.681
   docStoreOffset=159845
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159977 terms; 159977 terms/docs pairs; 
159977 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 3 of 34: name=_bb6i docCount=159691
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=294.701
   docStoreOffset=319822
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [159691 terms; 159691 terms/docs pairs; 
159691 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 4 of 34: name=_bb6j docCount=3978
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.382
   docStoreOffset=479513
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3978 terms; 3978 terms/docs pairs; 
3978 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 5 of 34: name=_bb6k docCount=4002
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.353
   docStoreOffset=483491
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [4002 terms; 4002 terms/docs pairs; 
4002 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 6 of 34: name=_bb6l docCount=3959
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.365
   docStoreOffset=487493
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3959 terms; 3959 terms/docs pairs; 
3959 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 7 of 34: name=_bb6m docCount=3938
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.35
   docStoreOffset=491452
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [3938 terms; 3938 terms/docs pairs; 
3938 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 8 of 34: name=_bb6n docCount=4020
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.379
   docStoreOffset=495390
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, norms...OK [1 fields]
   test: terms, freq, prox...OK [4020 terms; 4020 terms/docs pairs; 
4020 tokens]

   test: stored fields...OK [0 total field count; avg 0 fields per doc]
   test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


 9 of 34: name=_bb6o docCount=3973
   compound=true
   hasProx=true
   numFiles=2
   size (MB)=7.385
   docStoreOffset=499410
   docStoreSegment=_bb34
   docStoreIsCompoundFile=true
   no deletions
   test: open reader.OK
   test: fields, nor

Re: Filesystem based bitset

2009-01-18 Thread Paul Elschot
On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote:
> On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote:
> 
> > SSD is pretty close to RAM when it comes to seeking. Wouldn't that  
> > mean that a bitset stored on an SSD would be more or less as fast as a  
> > bitset in RAM? 
> 
> Provided that your index can fit in the system i/o cache and stay there, you
> get the speed of RAM regardless of the underlying permanent storage type.
> There's no reason to wait on SSDs before implementing such a feature.

Since this started by thinking out loud, I'd like to continue doing that.

I've been thinking about how to add a decent skipTo() to something that
compresses better than an (Open)BitSet, and this turns out to be an
integer set implemented as a B plus tree (all leafs on the same level) of
only integers with key/data compression by a frame of reference for
every node (see LUCENE-1410).

I found a java implementation for a B plus tree  on sourceforge: BpLusDotNet
in the BplusJ package, see http://bplusdotnet.sourceforge.net/ .
This has nice transaction semantics on a file system and it has a BSD licence,
so it could be used as a starting point, but:
- it only has strings as index values, so it will need quite some simplification
to work on integers as keys and data, and
- it has no built in compression as far as I could see on first inspection.

The questions:

Would someone know of a better starting point for a B plus tree of integers
with node compression?

For example, how close is the current lucene code base to implementing
a b plus tree for the doc ids of a single term?

How valuable are transaction semantics for such an integer set? It is
tempting to try and implement such an integer set by starting from the
ground up, but I don't have any practical programming experience with
transaction semantics, so it may be better to start from something that
has transactions right from the start.

Regards,
Paul Elschot


[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665008#action_12665008
 ] 

Mark Miller commented on LUCENE-1483:
-

Okay, I think I have it. I tried to count the terms per segment by getting a 
term enum and looping through it for each sub reader, but something must have 
been off with that count. Taking a second look with checkindex, and all of the 
docs / terms are piled into the first couple segments - the rest are a long 
tail with few terms. So it makes sense then - for every segment with few terms 
in it, all of the unique terms for the whole index get checked against it. A 
segment with even 1 term will be hit against for every unique term in the whole 
index. Thats what was happening in this case, as its like a logarithmic drop. 
I'll try playing around some more with less of a thin tail end of segments, but 
I guess that is enough to explain the drop in seeks in this case.

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 1/18/09 2:19 PM:
--

My previous results had a few oddities going with them (I was loosely playing 
around). Being a little more careful, here is an example of the difference, and 
the hotspots. Timings are probably not completely comparable as my comp couldnt 
keep up profiling the second version very well - its much slower without 
profiling as well though:

Index is 60 docs, 46 segments, 63849 unique terms.

Load the fieldcache on one multireader

||method||time||invocations||
|FieldCacheImpl.createValue|156536(98%)|1|
|MultiTermDocs.next()|148499(93.5%)|621803|
|MutliTermDocs(int)|140397(88.4%)|1002938|
|SegmentTermDocs.seek(Term)|138332(87.1%)|1002938|

load the fieldcache on each sub reader of the multireader, one at a time

||method||time||invocations||
|FieldCacheImpl.createValue|7815(80.4%)|46|
|SegmentTermDocs.next()|3315(34.1%)|642046|
|SegmentTermEnum.next()|1936(19.9%)|42046|
|SegmentTermDocs.seek(TermEnum)|874(9%)|42046|


*edit*
wrong values





  was (Author: markrmil...@gmail.com):
My previous results had a few oddities going with them (I was loosely 
playing around). Being a little more careful, here is an example of the 
difference, and the hotspots. Timings are probably not completely comparable as 
my comp couldnt keep up profiling the second version very well - its much 
slower without profiling as well though:

Index is 60 docs, 46 segments, 63849 unique terms.

Load the fieldcache on one multireader

||method||time||invocations||
|FieldCacheImpl.createValue|156536(98%)|1|
|MultiTermDocs.next()|148499(93.5%)|621803|
|MutliTermDocs(int)|140397(88.4%)|1002938|
|SegmentTermDocs.seek(Term)|138332(87.1%)|1002938|

load the fieldcache on each sub reader of the multireader, one at a time

||method||time||invocations||
|FieldCacheImpl.createValue|7815(80.4%)|46|
|SegmentTermDocs.next()|3315(34.1%)|642046|
|SegmentTermEnum.next()|1936(19.9%)|42046|
|SegmentTermDocs.seek(TermEnum)|874(9%)|42046|


Unique terms per segment:
21312,41837,41843,41849,41854,41860,41865,41870,41878,41883,41888,41894,41902,41906,41910,41912,41916,41921,41924
41930,41932,41936,41943,41947,41951,41956,41960,41964,41970,41974,41979,41982,41989,41994,41999,42002,42005
42007,42011,42016,42020,42026,42033,42039,42044,42046




  
> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-18 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1124.
-

Resolution: Fixed

Thanks! Committed.

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
>Assignee: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664988#action_12664988
 ] 

Michael McCandless commented on LUCENE-1483:


{quote}
In fact this probably causes the underlying buffer in
BufferedIndexReader to get reloaded many times whenever we cross a
boundary
{quote}
OK I was wrong about this -- there is logic in TermInfosReader.get to not go 
backwards to the last index term.  So we are in fact reading each tis file 
sequentially...

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984
 ] 

Mark Miller commented on LUCENE-1483:
-

My previous results had a few oddities going with them (I was loosely playing 
around). Being a little more careful, here is an example of the difference, and 
the hotspots. Timings are probably not completely comparable as my comp couldnt 
keep up profiling the second version very well - its much slower without 
profiling as well though:

Index is 60 docs, 46 segments, 63849 unique terms.

Load the fieldcache on one multireader

||method||time||invocations||
|FieldCacheImpl.createValue|156536(98%)|1|
|MultiTermDocs.next()|148499(93.5%)|621803|
|MutliTermDocs(int)|140397(88.4%)|1002938|
|SegmentTermDocs.seek(Term)|138332(87.1%)|1002938|

load the fieldcache on each sub reader of the multireader, one at a time

||method||time||invocations||
|FieldCacheImpl.createValue|7815(80.4%)|46|
|SegmentTermDocs.next()|3315(34.1%)|642046|
|SegmentTermEnum.next()|1936(19.9%)|42046|
|SegmentTermDocs.seek(TermEnum)|874(9%)|42046|


Unique terms per segment:
21312,41837,41843,41849,41854,41860,41865,41870,41878,41883,41888,41894,41902,41906,41910,41912,41916,41921,41924
41930,41932,41936,41943,41947,41951,41956,41960,41964,41970,41974,41979,41982,41989,41994,41999,42002,42005
42007,42011,42016,42020,42026,42033,42039,42044,42046





> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664983#action_12664983
 ] 

Yonik Seeley commented on LUCENE-1483:
--

bq. think the massive slowness of iterating through all terms & docs from a 
MultiTermEnum/Docs may come from asking the N-1 SegmentReaders to seek to a 
non-existent (for them) term.

I've seen cases where the MultiTermEnum was the bottleneck (compared to the 
MultiTermDocs) when iterating over all docs for all terms in a field.  But 
quickly looking at the code, MultiTermEnum.next() looks pretty efficient.

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664979#action_12664979
 ] 

Michael McCandless commented on LUCENE-1483:



bq. Even still, you are seeing like a 40% diff, but small enough times to not 
matter. 

Right, good point.

I think the massive slowness of iterating through all terms & docs
from a MultiTermEnum/Docs may come from asking the N-1 SegmentReaders
to seek to a non-existent (for them) term.

Ie when we ask MultiTermDocs to seek to a unique title X, only the
particular segment that title X comes from actually has it, whereas
the others do a costly seek to the index term just before it then scan
to look for the non-existent term, and then repeat that for the next
title, etc.

In fact this probably causes the underlying buffer in
BufferedIndexReader to get reloaded many times whenever we cross a
boundary (ie, we keep flipping between buffer N and N+1, then back to
N then N+1 again, etc.) -- maybe that's the source massive slowness?

BTW I think this change may also speed up Range/PrefixQuery as well.


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1314) IndexReader.clone

2009-01-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1314:
---

Attachment: LUCENE-1314.patch


New patch attached.  All tests pass.  Changes:

  * Simplified semantics: if you clone non-readOnly reader1 to
non-readOnly reader2, I now simply clear hasChanges & writeLock on
reader1 and transfer them to reader2, but do not set readOnly in
reader1.  This means reader1 is free to attempt to acquire the
write lock if it wants, and it simply fails if it's stale (ie, we
just re-use the existing code path to catch this, rather than add
a new check), and this way we never have a case where an existing
reader "becomes" readOnly -- it can only be born readOnly.

  * Added reopen(readOnly) (what Jason referred to above).  I think
the semantics are well defined: it returns a new reader if either
the index has changed or readOnly is different.

  * Added test for "clone readOnly to non-readOnly" case, which
failed, and fixed various places where we were not respecting
"openReadOnly" correctly.

  * Share common source (ReadOnlySegmentReader.noWrite) for throwing
exception on attempting change to a readOnly reader

  * Fixed a sneaky pre-existing bug with reopen (added test case): if
you have a non-readOnly reader on a single segment index, then add
a segment, then reopen it and try to do a delete w/ new reader, it
fails.  This is because we were incorrectly sharing the original
SegmentReader instance which still had its own SegmentInfos, so it
attempts to double-acquire write lock during a single deleteDocument
call.

  * More tests


> IndexReader.clone
> -
>
> Key: LUCENE-1314
> URL: https://issues.apache.org/jira/browse/LUCENE-1314
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch
>
>
> Based on discussion 
> http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
> is reopen returns the same reader if there are no changes, so if docs are 
> deleted from the new reader, they are also reflected in the previous reader 
> which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664965#action_12664965
 ] 

Mark Miller commented on LUCENE-1483:
-

I think its pretty costly even for non id type fields. In your enum case, their 
are what, 50 unique  values? Even still, you are seeing like a 40% diff, but 
small enough times to not matter.

My test example has 20,000 unique terms for 600,000 documents (lots of overlap, 
2-8 char strings, 1-9, I think), so quite a bit short of a primary key - but it 
still was WAY faster with the new method.

Old method non optimized, 79 segments - 1.5 million seeks, WAY slow.
Old method, optimized, 1 segment - 20,000 seeks, pretty darn fast.
New method, non optimized, 79 segments - 40,000 seeks, pretty darn fast.


bq.While there is a big difference between searching a single segment vs 
multisegments for these things, we already knew about that - thats why you 
optimize.

{quote}Right, but for realtime search you don't have the luxury of
optimizing. This patch makes warming time after reopen much faster
for a many-segment index for apps that use FieldCache with mostly unique String
fields.{quote}

Right, I got you - I know we can't optimize. I was just realizing that 
explaining why 100 segments was so slow was not explaining why the new method 
on 100 segments was so fast. I still don't think I fully have why that is. I 
don't think getting to use the unique terms at each segment saves enough seeks 
for what I am seeing. Especially in this test case, the terms should be pretty 
evenly distributed across segments...


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664948#action_12664948
 ] 

Michael McCandless commented on LUCENE-1483:



bq. As we call next on MultiTermDocs it will get a TermDocs for each Reader and 
call seek to get to the Term. The seek appears pretty slow, and we do it for 
the number of Readers x the number of Terms to be loaded.

Right -- the uninverting we do to populate the FieldCache is very
costly through MultiReader for fields that are mostly unique String
(eg a title field, or a "primary key" id field, etc.).

Enum type fields (like country) don't have this problem (1.0 sec vs
0.6 sec to populate FieldCache through MultiReader for the 100 segment
index).

But, with this change, we sidestep this problem for Lucene's core, but
for apps that directly load FieldCache for the MultiReader the problem
is still there.

Once we have column stride fields (LUCENE-1231) it should then be far
faster to load the FieldCache for unique String fields.

bq. While there is a big difference between searching a single segment vs 
multisegments for these things, we already knew about that - thats why you 
optimize.

Right, but for realtime search you don't have the luxury of
optimizing.  This patch makes warming time after reopen much faster
for a many-segment index for apps that use FieldCache with mostly unique String
fields.


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org