Re: Filesystem based bitset
Hi Paul, not really an answer to your questions, I just thought you may find it useful as a confirmation that this packing of integers into (B or some other) Tree is good one. I have seen Integer set distributions that can profit hugely from the tree organization on top. have look at: http://www.iis.uni-stuttgart.de/intset/ not meant for on disk storage, but the idea is quite similar. cheers, eks From: Paul Elschot paul.elsc...@xs4all.nl To: java-dev@lucene.apache.org Sent: Sunday, 18 January, 2009 23:51:36 Subject: Re: Filesystem based bitset On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote: On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote: SSD is pretty close to RAM when it comes to seeking. Wouldn't that mean that a bitset stored on an SSD would be more or less as fast as a bitset in RAM? Provided that your index can fit in the system i/o cache and stay there, you get the speed of RAM regardless of the underlying permanent storage type. There's no reason to wait on SSDs before implementing such a feature. Since this started by thinking out loud, I'd like to continue doing that. I've been thinking about how to add a decent skipTo() to something that compresses better than an (Open)BitSet, and this turns out to be an integer set implemented as a B plus tree (all leafs on the same level) of only integers with key/data compression by a frame of reference for every node (see LUCENE-1410). I found a java implementation for a B plus tree on sourceforge: BpLusDotNet in the BplusJ package, see http://bplusdotnet.sourceforge.net/ . This has nice transaction semantics on a file system and it has a BSD licence, so it could be used as a starting point, but: - it only has strings as index values, so it will need quite some simplification to work on integers as keys and data, and - it has no built in compression as far as I could see on first inspection. The questions: Would someone know of a better starting point for a B plus tree of integers with node compression? For example, how close is the current lucene code base to implementing a b plus tree for the doc ids of a single term? How valuable are transaction semantics for such an integer set? It is tempting to try and implement such an integer set by starting from the ground up, but I don't have any practical programming experience with transaction semantics, so it may be better to start from something that has transactions right from the start. Regards, Paul Elschot
Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
I'm also seeing decent gains (~13%) for sort-by-relevance (ie the default sort) term queries w/ large number (~97K and ~386K) of hits on 10 36 segment indices. So I agree, LUCENE-1483 is not just about speeding up sort-by-field queries. It seems to give good speedups all around, and of course warming time for sort-by-field searches goes way way down. We just gotta wrap it up now! Mike Mark Miller wrote: One more just for a check with much fewer unique terms (20k). Didn't catch that I didnt clamp down enough on the uniques last one. Back up to 21 segments this time, same wildcard search, 7718 hits, and the new method is still approx 20% faster than the old. The last run was 16 segments though with way more uniques - this one is 21 segments and way fewer uniques. 7718 Segments file=segments_l numSegments=21 version=FORMAT_USER_DATA [Lucene 2.9] 1 of 21: name=_bbxo docCount=29349 compound=true hasProx=true numFiles=2 size (MB)=11.92 docStoreOffset=0 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3875263 terms/docs pairs; 4516618 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 2 of 21: name=_bbxp docCount=29459 compound=true hasProx=true numFiles=2 size (MB)=11.982 docStoreOffset=29349 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3895590 terms/docs pairs; 4540859 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 3 of 21: name=_bbxq docCount=29300 compound=true hasProx=true numFiles=2 size (MB)=11.97 docStoreOffset=58808 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3890419 terms/docs pairs; 4536052 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 4 of 21: name=_bbxr docCount=29480 compound=true hasProx=true numFiles=2 size (MB)=11.971 docStoreOffset=88108 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3894211 terms/docs pairs; 4538397 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 5 of 21: name=_bbxs docCount=29470 compound=true hasProx=true numFiles=2 size (MB)=11.979 docStoreOffset=117588 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3895226 terms/docs pairs; 4540446 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 6 of 21: name=_bbxt docCount=29450 compound=true hasProx=true numFiles=2 size (MB)=11.98 docStoreOffset=147058 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3892708 terms/docs pairs; 4538338 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 7 of 21: name=_bbxu docCount=29509 compound=true hasProx=true numFiles=2 size (MB)=11.978 docStoreOffset=176508 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3894189 terms/docs pairs; 4538376 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/ freq vector fields per doc] 8 of 21: name=_bbxv docCount=29401 compound=true hasProx=true numFiles=2 size (MB)=11.976 docStoreOffset=206017 docStoreSegment=_bbxo docStoreIsCompoundFile=true no deletions test: open reader.OK test: fields, norms...OK [1 fields] test: terms, freq, prox...OK [21840 terms; 3891986 terms/docs pairs; 4538746 tokens] test: stored fields...OK [0 total field count; avg 0 fields per doc]
Re: Filesystem based bitset
Paul Elschot wrote: Since this started by thinking out loud, I'd like to continue doing that. I've been thinking about how to add a decent skipTo() to something that compresses better than an (Open)BitSet, and this turns out to be an integer set implemented as a B plus tree (all leafs on the same level) of only integers with key/data compression by a frame of reference for every node (see LUCENE-1410). Sounds great! With flexible indexing (LUCENE-1458, which I'm needing to get back to finish...) you could experiment with these sorts of changes to the postings format by implementing your own codec. For example, how close is the current lucene code base to implementing a b plus tree for the doc ids of a single term? I'm not sure this is a good fit -- B+ trees are great at insertion/deletion of entries, but we never do that with our postings (they are write once). Though if the set operations are substantially faster (??) than the doc-at-a-time iteration Lucene does today, then maybe it is compelling? But we'd have to change up how AND/OR queries work to translate into these set operations. How valuable are transaction semantics for such an integer set? It is tempting to try and implement such an integer set by starting from the ground up, but I don't have any practical programming experience with transaction semantics, so it may be better to start from something that has transactions right from the start. If we use this to store/access deleted docs in RAM, then transactions are very important for realtime search. With LUCENE-1314 (IndexReader.clone) a cloned reader carries over the deletes from the original reader but must copy on write as soon as a new deletion is made. With BitVector for deleted docs, this operation is very costly. But if we used B+ tree (or something similar) in RAM to hold the deleted docs, and that lets us incrementally copy-on-write only the nodes/blocks affected by the changes, that would be very useful. It could also be useful for storing deleted docs in the index, ie, this is an alternative to tombstones, in which case its transactional support would be good, to avoid writing an entire BitVector when only a few additional docs became deleted, during commit. This would fit nicely with Lucene's already transactional index storage, ie rather than storing the deletion generation (an int) that we store today, we'd store some reference into the B+ tree indicating the commit point to use for deletions. But I think this usage (changing how deletions are stored on disk) is less compelling than changing how deletions are stored/used in RAM. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665102#action_12665102 ] Michael McCandless commented on LUCENE-1483: I'm working on another iteration of this patch, cleaning things up, adding javadocs, etc., in preparation for committing... Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Using full norms (Was: Bubbling up newer records)
Hello, Michael McCandless wrote: The upcoming Lucene in Action revision (now available online through Manning's MEAP) has a basic example of this (boosting by recency) in the Advanced Search chapter, using function queries. I have never used function queries before, but it was very easy to boost more recent documents with help of FieldScoreQuery. This may be quite common usage. The result is based on computation during search time but the same result would be accomplished using document boost during indexing time (and certainly faster with less memory used). But there is a difference - document boost is used to compute document's norm value which is stored with precision loss (float encoded as byte). The question: Is still really an issue to encode norms as bytes? Do we lose less than we gain? Can someone imagine any real disadvantages of storing norms as full 4-bytes float? Nowadays? Best regards, Jiri Kuhn.
Using full norms (Was: Bubbling up newer records)
Hello, Michael McCandless wrote: The upcoming Lucene in Action revision (now available online through Manning's MEAP) has a basic example of this (boosting by recency) in the Advanced Search chapter, using function queries. I have never used function queries before, but it was very easy to boost more recent documents with help of FieldScoreQuery. This may be quite common usage. The result is based on computation during search time but the same result would be accomplished using document boost during indexing time (and certainly faster with less memory used). But there is a difference - document boost is used to compute document's norm value which is stored with precision loss (float encoded as byte). The question: Is still really an issue to encode norms as bytes? Do we lose less than we gain? Can someone imagine any real disadvantages of storing norms as full 4-bytes float? Nowadays? Best regards, Jiri Kuhn.
Re: Question on Lucene search
Please ask your question on java-u...@lucene.apache.org. Thanks, Grant On Jan 19, 2009, at 1:20 AM, fell wrote: Hi all, I am new to Lucene and I need to know the following: In case I have indexed some data using Lucene and it contains the fields: Location, City, Country. Suppose the data is as follows in the index in each of the above fields: 1) R G Heights 2) London 3) United Kindom If i try to search the index by putting the following in my query : 1) RG Heights (Please not R and G do not have space in the middle) or 2) RGHeights. (no space at all) or 3) R G Heights. (extra space between tokens), 4) Kingdom United. Please tell me if lucene would come up with a positive result or would it tell me 'no hits'. Please let me know this for each of the queries above! Thanks! -- View this message in context: http://www.nabble.com/Question-on-Lucene-search-tp21537509p21537509.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1522) another highlighter
another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); // docId=0, fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 ); {code} features: - fast for large docs - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: --- Attachment: LUCENE-1522.patch to apply this patch, LUCENE-1448 also need to be applied. {code} $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 LUCENE-1448.patch $ patch -p0 LUCENE-1522.patch {code} another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor Attachments: LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); // docId=0, fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 ); {code} features: - fast for large docs - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Filesystem based bitset
On Monday 19 January 2009 11:32:17 Michael McCandless wrote: Paul Elschot wrote: Since this started by thinking out loud, I'd like to continue doing that. I've been thinking about how to add a decent skipTo() to something that compresses better than an (Open)BitSet, and this turns out to be an integer set implemented as a B plus tree (all leafs on the same level) of only integers with key/data compression by a frame of reference for every node (see LUCENE-1410). Sounds great! With flexible indexing (LUCENE-1458, which I'm needing to get back to finish...) you could experiment with these sorts of changes to the postings format by implementing your own codec. I'll take a look there. For example, how close is the current lucene code base to implementing a b plus tree for the doc ids of a single term? I'm not sure this is a good fit -- B+ trees are great at insertion/deletion of entries, but we never do that with our postings (they are write once). Though if the set operations are substantially faster (??) than the doc-at-a-time iteration Lucene does today, then maybe it is compelling? But we'd have to change up how AND/OR queries work to translate into these set operations. The idea is to implement a DocIdSetIterator on top of this, with the usual next() and skipTo(), so it should fit in the current lucene framework. How valuable are transaction semantics for such an integer set? It is tempting to try and implement such an integer set by starting from the ground up, but I don't have any practical programming experience with transaction semantics, so it may be better to start from something that has transactions right from the start. If we use this to store/access deleted docs in RAM, then transactions are very important for realtime search. With LUCENE-1314 (IndexReader.clone) a cloned reader carries over the deletes from the original reader but must copy on write as soon as a new deletion is made. With BitVector for deleted docs, this operation is very costly. But if we used B+ tree (or something similar) in RAM to hold the deleted docs, and that lets us incrementally copy-on-write only the nodes/blocks affected by the changes, that would be very useful. The one referenced by Eks Dev would be a good starting point for that, it's basically a binary tree of BitSets of at most 1024 bits at the leafs. It could also be useful for storing deleted docs in the index, ie, this is an alternative to tombstones, in which case its transactional support would be good, to avoid writing an entire BitVector when only a few additional docs became deleted, during commit. This would fit nicely with Lucene's already transactional index storage, ie rather than storing the deletion generation (an int) that we store today, we'd store some reference into the B+ tree indicating the commit point to use for deletions. But I think this usage (changing how deletions are stored on disk) is less compelling than changing how deletions are stored/used in RAM. Thanks, Paul Elschot
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665252#action_12665252 ] Michael Busch commented on LUCENE-1483: --- Mark and Mike, this issue and the patch are amazingly long and catching up here after vacation is pretty hard. Maybe you could update the description of this issue with a summary (maybe a bullet list?) that describes the main goals and changes here? That would be great... Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: --- Description: I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers was: I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); // docId=0, fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 ); {code} features: - fast for large docs - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor Attachments: LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1483: Description: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReads, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). was:FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. Here is a start at a better summary. It could be improved. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReads, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1483: Description: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReads, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. * Alters ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;) * Deprecates ** TopFieldDocCollector ** FieldSortedHitQueue was: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReads, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1483: Description: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. * Alters ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;) * Deprecates ** TopFieldDocCollector ** FieldSortedHitQueue was: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReads, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1483: Description: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. * Alters ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;) * Deprecates ** TopFieldDocCollector ** FieldSortedHitQueue was: This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors
[jira] Closed: (LUCENE-1519) Change Primitive Data Types from int to long in class SegmentMerger.java
[ https://issues.apache.org/jira/browse/LUCENE-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak closed LUCENE-1519. -- No problem Change Primitive Data Types from int to long in class SegmentMerger.java Key: LUCENE-1519 URL: https://issues.apache.org/jira/browse/LUCENE-1519 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: lucene 2.4.0, jdk1.6.0_03/07/11 Reporter: Deepak Assignee: Michael McCandless Fix For: 2.9 Original Estimate: 4h Remaining Estimate: 4h Hi We are getting an exception while optimize. We are getting this exception mergeFields produced an invalid result: docCount is 385282378 but fdx file size is 3082259028; now aborting this merge to prevent index corruption I have checked the code for class SegmentMerger.java and found this check *** if (4+docCount*8 != fdxFileLength) // This is most likely a bug in Sun JRE 1.6.0_04/_05; // we detect that the bug has struck, here, and // throw an exception to prevent the corruption from // entering the index. See LUCENE-1282 for // details. throw new RuntimeException(mergeFields produced an invalid result: docCount is + docCount + but fdx file size is + fdxFileLength + ; now aborting this merge to prevent index corruption); } *** In our case docCount is 385282378 and fdxFileLength size is 3082259028, even though 4+385282378*8 is equal to 3082259028, the above code will not work because number 3082259028 is out of int range. So type of variable docCount needs to be changed to long I have written a small test for this public class SegmentMergerTest { public static void main(String[] args) { int docCount = 385282378; long fdxFileLength = 3082259028L; if(4+docCount*8 != fdxFileLength) System.out.println(No Match + (4+docCount*8)); else System.out.println(Match + (4+docCount*8)); } } Above test will print No Match but if you change the data type of docCount to long, it will print Match Can you please advise us if this issue will be fixed in next release? Regards Deepak -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Committed revision 735928.
Committed revision 735928. Adding myself to contrib committers list / testing karma Thanks Patrick scootie:site pjaol$ svn diff docs/*.html Index: docs/whoweare.html === --- docs/whoweare.html (revision 735927) +++ docs/whoweare.html (working copy) @@ -285,6 +285,9 @@ bWolfgang Hoschek/b (whosc...@...)/li li +bPatrick O'Leary/b (pj...@...)/li + +li bUwe Schindler/b (uschind...@...)/li li @@ -300,7 +303,7 @@ /div -a name=N10087/aa name=emeritus/a +a name=N1008C/aa name=emeritus/a h2 class=boxedEmeritus Committers/h2 div class=section ul scootie:site pjaol$ svn diff src/documentation/content/xdocs/whoweare.xml Index: src/documentation/content/xdocs/whoweare.xml === --- src/documentation/content/xdocs/whoweare.xml (revision 735927) +++ src/documentation/content/xdocs/whoweare.xml (working copy) @@ -31,6 +31,7 @@ section id=contribtitleContrib Committers/title ul libWolfgang Hoschek/b (whosc...@...)/li +libPatrick O'Leary/b (pj...@...)/li libUwe Schindler/b (uschind...@...)/li libAndi Vajda/b (va...@...)/li libKarl Wettin/b (ka...@...)/li