Question on Lucene search
Hi all, I am new to Lucene and I need to know the following: In case I have indexed some data using Lucene and it contains the fields: Location, City, Country. Suppose the data is as follows in the index in each of the above fields: 1) R G Heights 2) London 3) United Kindom If i try to search the index by putting the following in my query : 1) RG Heights (Please not R and G do not have space in the middle) or 2) RGHeights. (no space at all) or 3) R G Heights. (extra space between tokens), 4) Kingdom United. Please tell me if lucene would come up with a positive result or would it tell me 'no hits'. Please let me know this for each of the queries above! Thanks! -- View this message in context: http://www.nabble.com/Question-on-Lucene-search-tp21537509p21537509.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
One more just for a check with much fewer unique terms (20k). Didn't catch that I didnt clamp down enough on the uniques last one. Back up to 21 segments this time, same wildcard search, 7718 hits, and the new method is still approx 20% faster than the old. The last run was 16 segments though with way more uniques - this one is 21 segments and way fewer uniques. 7718 Segments file=segments_l numSegments=21 version=FORMAT_USER_DATA [Lucene 2.9] 1 of 21: name=_bbxo docCount=29349 compound=true hasProx=true numFiles=2 size (MB)=11.92 docStoreOffset=0 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3875263 terms/docs pairs; 4516618 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 21: name=_bbxp docCount=29459 compound=true hasProx=true numFiles=2 size (MB)=11.982 docStoreOffset=29349 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3895590 terms/docs pairs; 4540859 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 3 of 21: name=_bbxq docCount=29300 compound=true hasProx=true numFiles=2 size (MB)=11.97 docStoreOffset=58808 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3890419 terms/docs pairs; 4536052 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 4 of 21: name=_bbxr docCount=29480 compound=true hasProx=true numFiles=2 size (MB)=11.971 docStoreOffset=88108 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3894211 terms/docs pairs; 4538397 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 5 of 21: name=_bbxs docCount=29470 compound=true hasProx=true numFiles=2 size (MB)=11.979 docStoreOffset=117588 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3895226 terms/docs pairs; 4540446 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 6 of 21: name=_bbxt docCount=29450 compound=true hasProx=true numFiles=2 size (MB)=11.98 docStoreOffset=147058 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3892708 terms/docs pairs; 4538338 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 7 of 21: name=_bbxu docCount=29509 compound=true hasProx=true numFiles=2 size (MB)=11.978 docStoreOffset=176508 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3894189 terms/docs pairs; 4538376 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 8 of 21: name=_bbxv docCount=29401 compound=true hasProx=true numFiles=2 size (MB)=11.976 docStoreOffset=206017 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3891986 terms/docs pairs; 4538746 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 9 of 21: name=_bbxw docCount=29476 compound=true hasProx=true numFiles=2 size (MB)=11.988 docStoreOffset=235418 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields,
[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984 ] markrmil...@gmail.com edited comment on LUCENE-1483 at 1/18/09 8:20 PM: -- My previous results had a few oddities going with them (I was loosely playing around). Being a little more careful, here is an example of the difference, and the hotspots. Timings are probably not completely comparable as my comp couldnt keep up profiling the second version very well - its much slower without profiling as well though: Index is 60 docs, 46 segments Load the fieldcache on one multireader ||method||time||invocations|| |FieldCacheImpl.createValue|156536(98%)|1| |MultiTermDocs.next()|148499(93.5%)|621803| |MutliTermDocs(int)|140397(88.4%)|1002938| |SegmentTermDocs.seek(Term)|138332(87.1%)|1002938| load the fieldcache on each sub reader of the multireader, one at a time ||method||time||invocations|| |FieldCacheImpl.createValue|7815(80.4%)|46| |SegmentTermDocs.next()|3315(34.1%)|642046| |SegmentTermEnum.next()|1936(19.9%)|42046| |SegmentTermDocs.seek(TermEnum)|874(9%)|42046| *edit* wrong values was (Author: markrmil...@gmail.com): My previous results had a few oddities going with them (I was loosely playing around). Being a little more careful, here is an example of the difference, and the hotspots. Timings are probably not completely comparable as my comp couldnt keep up profiling the second version very well - its much slower without profiling as well though: Index is 60 docs, 46 segments, 63849 unique terms. Load the fieldcache on one multireader ||method||time||invocations|| |FieldCacheImpl.createValue|156536(98%)|1| |MultiTermDocs.next()|148499(93.5%)|621803| |MutliTermDocs(int)|140397(88.4%)|1002938| |SegmentTermDocs.seek(Term)|138332(87.1%)|1002938| load the fieldcache on each sub reader of the multireader, one at a time ||method||time||invocations|| |FieldCacheImpl.createValue|7815(80.4%)|46| |SegmentTermDocs.next()|3315(34.1%)|642046| |SegmentTermEnum.next()|1936(19.9%)|42046| |SegmentTermDocs.seek(TermEnum)|874(9%)|42046| *edit* wrong values > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
Man, I'm not paying attention. I switched the analyzer but didnt take it off UNANALYZED. Here are the correct results. Its actually only like 20-30% faster for the index I used. So a lot of that could be the gains that we were seeing in general anyway. Perhaps a bit more too though. Still a total of 60 docs. WildcardQuery query = new WildcardQuery(new Term("string", "00011*")); query.setConstantScoreRewrite(true); TopDocs results = searcher.search(query, 10); 26817 hits Segments file=segments_h numSegments=16 version=FORMAT_USER_DATA [Lucene 2.9] 1 of 16: name=_bbs7 docCount=114352 compound=true hasProx=true numFiles=2 size (MB)=92.409 docStoreOffset=0 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [488265 terms; 25989763 terms/docs pairs; 29071032 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 16: name=_bbtc docCount=114210 compound=true hasProx=true numFiles=2 size (MB)=92.42 docStoreOffset=114352 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [488268 terms; 25992529 terms/docs pairs; 29076321 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 3 of 16: name=_bbuh docCount=114299 compound=true hasProx=true numFiles=2 size (MB)=92.42 docStoreOffset=228562 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [488263 terms; 25997224 terms/docs pairs; 29078745 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 4 of 16: name=_bbvm docCount=114177 compound=true hasProx=true numFiles=2 size (MB)=92.444 docStoreOffset=342861 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [488267 terms; 25992316 terms/docs pairs; 29080800 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 5 of 16: name=_bbwr docCount=114308 compound=true hasProx=true numFiles=2 size (MB)=92.452 docStoreOffset=457038 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [488268 terms; 26005497 terms/docs pairs; 29088895 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 6 of 16: name=_bbws docCount=2760 compound=true hasProx=true numFiles=2 size (MB)=3.26 docStoreOffset=571346 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [168214 terms; 647045 terms/docs pairs; 724715 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 7 of 16: name=_bbwt docCount=2820 compound=true hasProx=true numFiles=2 size (MB)=3.259 docStoreOffset=574106 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [168255 terms; 649189 terms/docs pairs; 725671 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 8 of 16: name=_bbwu docCount=2844 compound=true hasProx=true numFiles=2 size (MB)=3.263 docStoreOffset=576926 docStoreSegment=_bbr3 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [168210 terms; 649731 terms/docs pairs; 727095 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 9 of 16: name=_bbwv docCount=2871 compound=true hasProx=true numFiles=2 size (MB)=3.266 docStore
Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
Oh yeah, thats a constantscore wildcard query. Mark Miller wrote: Just checked it out, and its not a bad win on multi term queries. Its not the same exponential gain as field cache loading, but I bet lots of 2-3x type stuff. You appear to save a decent amount by not applying every term to each segment because of the logarithmic sizing. My query of: new WildcardQuery(new Term("string", "00*") gets 789 hits, and takes half the time with this patch. The index is below: Segments file=segments_a numSegments=34 version=FORMAT_USER_DATA [Lucene 2.9] 1 of 34: name=_bb48 docCount=159845 compound=true hasProx=true numFiles=2 size (MB)=294.529 docStoreOffset=0 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159845 terms; 159845 terms/docs pairs; 159845 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 34: name=_bb5d docCount=159977 compound=true hasProx=true numFiles=2 size (MB)=294.681 docStoreOffset=159845 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159977 terms; 159977 terms/docs pairs; 159977 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 3 of 34: name=_bb6i docCount=159691 compound=true hasProx=true numFiles=2 size (MB)=294.701 docStoreOffset=319822 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159691 terms; 159691 terms/docs pairs; 159691 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 4 of 34: name=_bb6j docCount=3978 compound=true hasProx=true numFiles=2 size (MB)=7.382 docStoreOffset=479513 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3978 terms; 3978 terms/docs pairs; 3978 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 5 of 34: name=_bb6k docCount=4002 compound=true hasProx=true numFiles=2 size (MB)=7.353 docStoreOffset=483491 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [4002 terms; 4002 terms/docs pairs; 4002 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 6 of 34: name=_bb6l docCount=3959 compound=true hasProx=true numFiles=2 size (MB)=7.365 docStoreOffset=487493 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3959 terms; 3959 terms/docs pairs; 3959 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 7 of 34: name=_bb6m docCount=3938 compound=true hasProx=true numFiles=2 size (MB)=7.35 docStoreOffset=491452 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3938 terms; 3938 terms/docs pairs; 3938 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 8 of 34: name=_bb6n docCount=4020 compound=true hasProx=true numFiles=2 size (MB)=7.379 docStoreOffset=495390 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [4020 terms; 4020 terms/docs pairs; 4020 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 9 of 34: name=_bb6o docCount=3973 compound=true hasProx=true numFiles=2 size (MB)=7.385 docStoreOffset=499410 docStoreSegment=_bb34 docStoreIsCompoundFile=true
Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
Just checked it out, and its not a bad win on multi term queries. Its not the same exponential gain as field cache loading, but I bet lots of 2-3x type stuff. You appear to save a decent amount by not applying every term to each segment because of the logarithmic sizing. My query of: new WildcardQuery(new Term("string", "00*") gets 789 hits, and takes half the time with this patch. The index is below: Segments file=segments_a numSegments=34 version=FORMAT_USER_DATA [Lucene 2.9] 1 of 34: name=_bb48 docCount=159845 compound=true hasProx=true numFiles=2 size (MB)=294.529 docStoreOffset=0 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159845 terms; 159845 terms/docs pairs; 159845 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 34: name=_bb5d docCount=159977 compound=true hasProx=true numFiles=2 size (MB)=294.681 docStoreOffset=159845 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159977 terms; 159977 terms/docs pairs; 159977 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 3 of 34: name=_bb6i docCount=159691 compound=true hasProx=true numFiles=2 size (MB)=294.701 docStoreOffset=319822 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [159691 terms; 159691 terms/docs pairs; 159691 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 4 of 34: name=_bb6j docCount=3978 compound=true hasProx=true numFiles=2 size (MB)=7.382 docStoreOffset=479513 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3978 terms; 3978 terms/docs pairs; 3978 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 5 of 34: name=_bb6k docCount=4002 compound=true hasProx=true numFiles=2 size (MB)=7.353 docStoreOffset=483491 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [4002 terms; 4002 terms/docs pairs; 4002 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 6 of 34: name=_bb6l docCount=3959 compound=true hasProx=true numFiles=2 size (MB)=7.365 docStoreOffset=487493 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3959 terms; 3959 terms/docs pairs; 3959 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 7 of 34: name=_bb6m docCount=3938 compound=true hasProx=true numFiles=2 size (MB)=7.35 docStoreOffset=491452 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [3938 terms; 3938 terms/docs pairs; 3938 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 8 of 34: name=_bb6n docCount=4020 compound=true hasProx=true numFiles=2 size (MB)=7.379 docStoreOffset=495390 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [4020 terms; 4020 terms/docs pairs; 4020 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 9 of 34: name=_bb6o docCount=3973 compound=true hasProx=true numFiles=2 size (MB)=7.385 docStoreOffset=499410 docStoreSegment=_bb34 docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, nor
Re: Filesystem based bitset
On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote: > On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote: > > > SSD is pretty close to RAM when it comes to seeking. Wouldn't that > > mean that a bitset stored on an SSD would be more or less as fast as a > > bitset in RAM? > > Provided that your index can fit in the system i/o cache and stay there, you > get the speed of RAM regardless of the underlying permanent storage type. > There's no reason to wait on SSDs before implementing such a feature. Since this started by thinking out loud, I'd like to continue doing that. I've been thinking about how to add a decent skipTo() to something that compresses better than an (Open)BitSet, and this turns out to be an integer set implemented as a B plus tree (all leafs on the same level) of only integers with key/data compression by a frame of reference for every node (see LUCENE-1410). I found a java implementation for a B plus tree on sourceforge: BpLusDotNet in the BplusJ package, see http://bplusdotnet.sourceforge.net/ . This has nice transaction semantics on a file system and it has a BSD licence, so it could be used as a starting point, but: - it only has strings as index values, so it will need quite some simplification to work on integers as keys and data, and - it has no built in compression as far as I could see on first inspection. The questions: Would someone know of a better starting point for a B plus tree of integers with node compression? For example, how close is the current lucene code base to implementing a b plus tree for the doc ids of a single term? How valuable are transaction semantics for such an integer set? It is tempting to try and implement such an integer set by starting from the ground up, but I don't have any practical programming experience with transaction semantics, so it may be better to start from something that has transactions right from the start. Regards, Paul Elschot
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665008#action_12665008 ] Mark Miller commented on LUCENE-1483: - Okay, I think I have it. I tried to count the terms per segment by getting a term enum and looping through it for each sub reader, but something must have been off with that count. Taking a second look with checkindex, and all of the docs / terms are piled into the first couple segments - the rest are a long tail with few terms. So it makes sense then - for every segment with few terms in it, all of the unique terms for the whole index get checked against it. A segment with even 1 term will be hit against for every unique term in the whole index. Thats what was happening in this case, as its like a logarithmic drop. I'll try playing around some more with less of a thin tail end of segments, but I guess that is enough to explain the drop in seeks in this case. > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984 ] markrmil...@gmail.com edited comment on LUCENE-1483 at 1/18/09 2:19 PM: -- My previous results had a few oddities going with them (I was loosely playing around). Being a little more careful, here is an example of the difference, and the hotspots. Timings are probably not completely comparable as my comp couldnt keep up profiling the second version very well - its much slower without profiling as well though: Index is 60 docs, 46 segments, 63849 unique terms. Load the fieldcache on one multireader ||method||time||invocations|| |FieldCacheImpl.createValue|156536(98%)|1| |MultiTermDocs.next()|148499(93.5%)|621803| |MutliTermDocs(int)|140397(88.4%)|1002938| |SegmentTermDocs.seek(Term)|138332(87.1%)|1002938| load the fieldcache on each sub reader of the multireader, one at a time ||method||time||invocations|| |FieldCacheImpl.createValue|7815(80.4%)|46| |SegmentTermDocs.next()|3315(34.1%)|642046| |SegmentTermEnum.next()|1936(19.9%)|42046| |SegmentTermDocs.seek(TermEnum)|874(9%)|42046| *edit* wrong values was (Author: markrmil...@gmail.com): My previous results had a few oddities going with them (I was loosely playing around). Being a little more careful, here is an example of the difference, and the hotspots. Timings are probably not completely comparable as my comp couldnt keep up profiling the second version very well - its much slower without profiling as well though: Index is 60 docs, 46 segments, 63849 unique terms. Load the fieldcache on one multireader ||method||time||invocations|| |FieldCacheImpl.createValue|156536(98%)|1| |MultiTermDocs.next()|148499(93.5%)|621803| |MutliTermDocs(int)|140397(88.4%)|1002938| |SegmentTermDocs.seek(Term)|138332(87.1%)|1002938| load the fieldcache on each sub reader of the multireader, one at a time ||method||time||invocations|| |FieldCacheImpl.createValue|7815(80.4%)|46| |SegmentTermDocs.next()|3315(34.1%)|642046| |SegmentTermEnum.next()|1936(19.9%)|42046| |SegmentTermDocs.seek(TermEnum)|874(9%)|42046| Unique terms per segment: 21312,41837,41843,41849,41854,41860,41865,41870,41878,41883,41888,41894,41902,41906,41910,41912,41916,41921,41924 41930,41932,41936,41943,41947,41951,41956,41960,41964,41970,41974,41979,41982,41989,41994,41999,42002,42005 42007,42011,42016,42020,42026,42033,42039,42044,42046 > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1124. - Resolution: Fixed Thanks! Committed. > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man >Assignee: Mark Miller >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664988#action_12664988 ] Michael McCandless commented on LUCENE-1483: {quote} In fact this probably causes the underlying buffer in BufferedIndexReader to get reloaded many times whenever we cross a boundary {quote} OK I was wrong about this -- there is logic in TermInfosReader.get to not go backwards to the last index term. So we are in fact reading each tis file sequentially... > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664984#action_12664984 ] Mark Miller commented on LUCENE-1483: - My previous results had a few oddities going with them (I was loosely playing around). Being a little more careful, here is an example of the difference, and the hotspots. Timings are probably not completely comparable as my comp couldnt keep up profiling the second version very well - its much slower without profiling as well though: Index is 60 docs, 46 segments, 63849 unique terms. Load the fieldcache on one multireader ||method||time||invocations|| |FieldCacheImpl.createValue|156536(98%)|1| |MultiTermDocs.next()|148499(93.5%)|621803| |MutliTermDocs(int)|140397(88.4%)|1002938| |SegmentTermDocs.seek(Term)|138332(87.1%)|1002938| load the fieldcache on each sub reader of the multireader, one at a time ||method||time||invocations|| |FieldCacheImpl.createValue|7815(80.4%)|46| |SegmentTermDocs.next()|3315(34.1%)|642046| |SegmentTermEnum.next()|1936(19.9%)|42046| |SegmentTermDocs.seek(TermEnum)|874(9%)|42046| Unique terms per segment: 21312,41837,41843,41849,41854,41860,41865,41870,41878,41883,41888,41894,41902,41906,41910,41912,41916,41921,41924 41930,41932,41936,41943,41947,41951,41956,41960,41964,41970,41974,41979,41982,41989,41994,41999,42002,42005 42007,42011,42016,42020,42026,42033,42039,42044,42046 > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664983#action_12664983 ] Yonik Seeley commented on LUCENE-1483: -- bq. think the massive slowness of iterating through all terms & docs from a MultiTermEnum/Docs may come from asking the N-1 SegmentReaders to seek to a non-existent (for them) term. I've seen cases where the MultiTermEnum was the bottleneck (compared to the MultiTermDocs) when iterating over all docs for all terms in a field. But quickly looking at the code, MultiTermEnum.next() looks pretty efficient. > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664979#action_12664979 ] Michael McCandless commented on LUCENE-1483: bq. Even still, you are seeing like a 40% diff, but small enough times to not matter. Right, good point. I think the massive slowness of iterating through all terms & docs from a MultiTermEnum/Docs may come from asking the N-1 SegmentReaders to seek to a non-existent (for them) term. Ie when we ask MultiTermDocs to seek to a unique title X, only the particular segment that title X comes from actually has it, whereas the others do a costly seek to the index term just before it then scan to look for the non-existent term, and then repeat that for the next title, etc. In fact this probably causes the underlying buffer in BufferedIndexReader to get reloaded many times whenever we cross a boundary (ie, we keep flipping between buffer N and N+1, then back to N then N+1 again, etc.) -- maybe that's the source massive slowness? BTW I think this change may also speed up Range/PrefixQuery as well. > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1314) IndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1314: --- Attachment: LUCENE-1314.patch New patch attached. All tests pass. Changes: * Simplified semantics: if you clone non-readOnly reader1 to non-readOnly reader2, I now simply clear hasChanges & writeLock on reader1 and transfer them to reader2, but do not set readOnly in reader1. This means reader1 is free to attempt to acquire the write lock if it wants, and it simply fails if it's stale (ie, we just re-use the existing code path to catch this, rather than add a new check), and this way we never have a case where an existing reader "becomes" readOnly -- it can only be born readOnly. * Added reopen(readOnly) (what Jason referred to above). I think the semantics are well defined: it returns a new reader if either the index has changed or readOnly is different. * Added test for "clone readOnly to non-readOnly" case, which failed, and fixed various places where we were not respecting "openReadOnly" correctly. * Share common source (ReadOnlySegmentReader.noWrite) for throwing exception on attempting change to a readOnly reader * Fixed a sneaky pre-existing bug with reopen (added test case): if you have a non-readOnly reader on a single segment index, then add a segment, then reopen it and try to do a delete w/ new reader, it fails. This is because we were incorrectly sharing the original SegmentReader instance which still had its own SegmentInfos, so it attempts to double-acquire write lock during a single deleteDocument call. * More tests > IndexReader.clone > - > > Key: LUCENE-1314 > URL: https://issues.apache.org/jira/browse/LUCENE-1314 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch > > > Based on discussion > http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem > is reopen returns the same reader if there are no changes, so if docs are > deleted from the new reader, they are also reflected in the previous reader > which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664965#action_12664965 ] Mark Miller commented on LUCENE-1483: - I think its pretty costly even for non id type fields. In your enum case, their are what, 50 unique values? Even still, you are seeing like a 40% diff, but small enough times to not matter. My test example has 20,000 unique terms for 600,000 documents (lots of overlap, 2-8 char strings, 1-9, I think), so quite a bit short of a primary key - but it still was WAY faster with the new method. Old method non optimized, 79 segments - 1.5 million seeks, WAY slow. Old method, optimized, 1 segment - 20,000 seeks, pretty darn fast. New method, non optimized, 79 segments - 40,000 seeks, pretty darn fast. bq.While there is a big difference between searching a single segment vs multisegments for these things, we already knew about that - thats why you optimize. {quote}Right, but for realtime search you don't have the luxury of optimizing. This patch makes warming time after reopen much faster for a many-segment index for apps that use FieldCache with mostly unique String fields.{quote} Right, I got you - I know we can't optimize. I was just realizing that explaining why 100 segments was so slow was not explaining why the new method on 100 segments was so fast. I still don't think I fully have why that is. I don't think getting to use the unique terms at each segment saves enough seeks for what I am seeing. Especially in this test case, the terms should be pretty evenly distributed across segments... > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664948#action_12664948 ] Michael McCandless commented on LUCENE-1483: bq. As we call next on MultiTermDocs it will get a TermDocs for each Reader and call seek to get to the Term. The seek appears pretty slow, and we do it for the number of Readers x the number of Terms to be loaded. Right -- the uninverting we do to populate the FieldCache is very costly through MultiReader for fields that are mostly unique String (eg a title field, or a "primary key" id field, etc.). Enum type fields (like country) don't have this problem (1.0 sec vs 0.6 sec to populate FieldCache through MultiReader for the 100 segment index). But, with this change, we sidestep this problem for Lucene's core, but for apps that directly load FieldCache for the MultiReader the problem is still there. Once we have column stride fields (LUCENE-1231) it should then be far faster to load the FieldCache for unique String fields. bq. While there is a big difference between searching a single segment vs multisegments for these things, we already knew about that - thats why you optimize. Right, but for realtime search you don't have the luxury of optimizing. This patch makes warming time after reopen much faster for a many-segment index for apps that use FieldCache with mostly unique String fields. > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org