Re: LocalLucene sorting issue
Ryan McKinley wrote: In order to get spatial lucene into solr, we need to figure out how to fix the memory leak described in: https://issues.apache.org/jira/browse/LUCENE-1304 Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as the _real_ solution while LUCENE-1304 would just be a deprecated band-aid (for the record, band-aids are quite useful). Before delving into this again, it looks like LUCENE-1483 is finished, but I don't understand how it fixes the CustomSort stuff. Also, I don't see what the deprecated sorting stuff should be replaced with... The fix should be that comparators are no longer cached with LUCENE-1483 as long as you use the new API. The new API is the FieldComparator, and you supply one with a FieldComparatorSource. The FieldComparator may look a little complicated, but its fairly straightforward for the primitive (non String) types - you should be able to roughly copy one. org.apache.lucene.search.FieldComparator There is a new SortField constructor that takes a FieldComparatorSource. thanks for any pointers ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity
[ https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688494#action_12688494 ] Mark Miller commented on LUCENE-1570: - Yonik spit out a bit of a better answer while I typed - right, you do have access to the field in getWildcardQuery, and the leading check happens there, so you can override it. My brain always runs towards building the support in, but in this case it may be clear to leave it out anyway. Its somewhat of a niche concern. Just had the new QueryParser on my mind. > QueryParser.setAllowLeadingWildcard could provide finer granularity > --- > > Key: LUCENE-1570 > URL: https://issues.apache.org/jira/browse/LUCENE-1570 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.4.1 >Reporter: Jonathan Watt > > It's great that Lucene now allows support for leading wildcards to be turned > on. However, leading wildcard searches are more expensive, so it would be > useful to be able to turn it on only for certain search fields. I'm > specifically thinking of wiki searches where it may be too expensive to allow > leading wildcards in the 'content:' field, but it would still be very useful > to be able to selectively turn on support for 'path:' and perhaps other > fields such as 'title:'. Would this be possible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity
[ https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688486#action_12688486 ] Mark Miller commented on LUCENE-1570: - I've wanted this in the past. Its certainly possible, but I am not sure how easy it would be to do with the current queryparser (been a long time since I have been there). There appears to be a new parser on the horizon though, and it sounds as if it will allow these types of additions much more elegantly (the current queryparser does not use a syntax tree representation, and its kind of hairy to build on). If I remember right, the current QueryParser simply attaches semantic actions to grammar production rules - difficult to read, edit, and maintain - has not been super friendly for building upon. Also if I remember right, I think this new parser will use abstract syntax trees, which lets you split up syntax and semantics, and also keep things a bit more modular - you can do things like have pluggable syntax reader that feeds pluggable query output writer. At least for the basics - it sounds like these guys have made something pretty cool, but I have not seen the code yet and have only a brief memory of its description. Point being, it can be done, I think its useful, but it might make sense to see how much easier it can be done with this new parser. > QueryParser.setAllowLeadingWildcard could provide finer granularity > --- > > Key: LUCENE-1570 > URL: https://issues.apache.org/jira/browse/LUCENE-1570 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.4.1 >Reporter: Jonathan Watt > > It's great that Lucene now allows support for leading wildcards to be turned > on. However, leading wildcard searches are more expensive, so it would be > useful to be able to turn it on only for certain search fields. I'm > specifically thinking of wiki searches where it may be too expensive to allow > leading wildcards in the 'content:' field, but it would still be very useful > to be able to selectively turn on support for 'path:' and perhaps other > fields such as 'title:'. Would this be possible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity
[ https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688482#action_12688482 ] Yonik Seeley commented on LUCENE-1570: -- This is pretty easy to implement by overriding QueryParser.getWildcardQuery(). > QueryParser.setAllowLeadingWildcard could provide finer granularity > --- > > Key: LUCENE-1570 > URL: https://issues.apache.org/jira/browse/LUCENE-1570 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.4.1 >Reporter: Jonathan Watt > > It's great that Lucene now allows support for leading wildcards to be turned > on. However, leading wildcard searches are more expensive, so it would be > useful to be able to turn it on only for certain search fields. I'm > specifically thinking of wiki searches where it may be too expensive to allow > leading wildcards in the 'content:' field, but it would still be very useful > to be able to selectively turn on support for 'path:' and perhaps other > fields such as 'title:'. Would this be possible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
LocalLucene sorting issue
In order to get spatial lucene into solr, we need to figure out how to fix the memory leak described in: https://issues.apache.org/jira/browse/LUCENE-1304 Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as the _real_ solution while LUCENE-1304 would just be a deprecated band- aid (for the record, band-aids are quite useful). Before delving into this again, it looks like LUCENE-1483 is finished, but I don't understand how it fixes the CustomSort stuff. Also, I don't see what the deprecated sorting stuff should be replaced with... thanks for any pointers ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity
QueryParser.setAllowLeadingWildcard could provide finer granularity --- Key: LUCENE-1570 URL: https://issues.apache.org/jira/browse/LUCENE-1570 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4.1 Reporter: Jonathan Watt It's great that Lucene now allows support for leading wildcards to be turned on. However, leading wildcard searches are more expensive, so it would be useful to be able to turn it on only for certain search fields. I'm specifically thinking of wiki searches where it may be too expensive to allow leading wildcards in the 'content:' field, but it would still be very useful to be able to selectively turn on support for 'path:' and perhaps other fields such as 'title:'. Would this be possible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Improve worst-case performance of TrieRange queries
Let me give an example to explain my idea - I'm using dates in my example, because it's easier to imagine :) Let's say we have the following posting lists. There are 20 docs in the index and an X means that a doc contains the corresponding term: JanX X Feb XX X Mar X AprXX MayX Jun Jul XX Aug X X Sep X Oct X Nov X X Dec X X Then we index another term 'ALL'. It gets added for any document that has a numeric value in this bucket: All X XX If the query is [Jun TO Jul], then we process the query normally (ORing terms Jun and Jul). If the query is [Feb TO Nov], then we basically translate that into All AND NOT (Jan OR Dec). Since you only evaluate the complement of the terms, you can (almost) double the worst case performance. Downsides: - you have to have another BitSet in memory to perform the AND NOT operation, so it needs more memory - this complement approach is only this simple for numeric fields where one document has only a single value; similar things are doable for multi-valued numeric fields, however more complex and possibly with less performance gain - you need to index an additional term per bucket, so the index size increases slightly Does this make sense? Maybe this has even been discussed already? -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688454#action_12688454 ] Michael McCandless commented on LUCENE-1522: bq. I think this is an unrealistic requirement in some cases (e.g. AND queries). I agree. > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688451#action_12688451 ] Michael Busch commented on LUCENE-1522: --- {quote} (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it). {quote} I think this is an unrealistic requirement in some cases (e.g. AND queries). I agree it makes sense for phrases to show them entirely in a fragment (even if that means not to show the beginning of a sentence). But often you have only one or two lines of text to display an extract. Then it might be a better choice to show two decently sized fragments with some context around the highlighted terms, rather than showing e.g. 4 short fragments just to show all 4 highlighted query terms (e.g. for query '+a +b +c +d') > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688449#action_12688449 ] Mike Klaas commented on LUCENE-1561: I agree that it is going to be almost impossible to convey that phrase queries don't work by renaming the flag. I agree with Eks Dev that a positive formulation is the only chance, although this deviates from the current omit* flags. termPresenceOnly() trackTermPresenceOnly() onlyTermPresence() omitEverythingButTermPresence() // just kidding > Maybe rename Field.omitTf, and strengthen the javadocs > -- > > Key: LUCENE-1561 > URL: https://issues.apache.org/jira/browse/LUCENE-1561 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1561.patch > > > Spinoff from here: > > http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html > Maybe rename omitTf to something like omitTermPositions, and make it clear > what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688448#action_12688448 ] Mark Miller commented on LUCENE-1522: - {quote}But that's really quite a serious problem; it's the kind that immediately erodes user's trust. Though if this user had used SpanScorer it would have been fixed (right?).{quote} Right - my point was more that it was a common complaint and has been solved in one way or another for a long time. Even back when that post occured, there was a JIRA highlighter that worked with phrase queries I think. There have been at least one or two besides the SpanScorer. {quote}Is there any reason not to use SpanScorer (vs QueryScorer)?{quote} It is slower when working with position sensitive clauses - because it actually does some work. For non position sensitive terms, its the same speed as the standard. Makes sense to me to always use it, but if you don't care and want every term highlighted, why pay the price I guess... {quote} Well... I'd still like to explore some way to better integrate w/ core (just don't have enough time, but maybe if I keep talking about it here, someone else will get the itch + time . {quote} Right - don't get me wrong - I was just getting thoughts in my head down. These types of brain dumps you higher level guys do def leads to work getting done - the SpanScorer came directly from these types of discussions, and quite a bit later - the original discussion happened before my time. {quote} Well this is open source after all. Things get "naturally prioritized". A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters. Yeah also another common theme in open-source development, though it's in good company: evolution and capitalism share the same "flaw". {quote} Right. I suppose I was just suggesting that something more practical might make more sense (more musing than suggesting). And practical in terms of how much activity we have seen in the highlighter area (fairly low, and not usually to the extent needed to get something committed and in use). And the split work on the highlighters is fine - but if we had the right highlighter base, more work could have been concentrated on the highlighter thats most used. Not really a complaint, but idea for the future. If we can get something better going, perhaps we can get to the point were people work with the current implementation rather than creating a new one every time. > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@l
Re: Modularization
On 23-Mar-09, at 2:41 PM, Michael McCandless wrote: I agree, but at least we need some clear criteria so the future decision process is more straightforward. Towards that... it seems like there are good reasons why something should be put into contrib: * It uses a version of JDK higher than what core can allow * It has external dependencies * Its quality is debatable (or at least not proven) * It's of somewhat narrow usage/interest (eg: contrib/bdb) But I don't think "it doesn't have to be in core" (the "software modularity" goal) is the right reason to put something in contrib. Agreed. I don't think that building on the existing 'contrib' is the way to go. Frequently-used, high-quality components should be more properly part of "Lucene", whether that means that they move to core, or in a new blessed modules section. Getting back to the original topic: Trie(Numeric)RangeFilter runs on JDK 1.4, has no external dependencies, looks to be high quality, and likely will have wide appeal. Doesn't it belong in core? +1. It is important that Lucene come blessed with very good quality defaults. Fast range queries are a common requirement. Similarly, I wouldn't be happy to have a new, wicked QueryParser be relegated to contrib where it is unlikely to be found by non-savvy users. At the very least, I agree with Michael that it should be findable in the same "place". It does make sense to separate the machinery/building blocks (base Query, Weight, Scorer, Filter classes, Similarity interface, etc.) from the Query/Filter implementations that use them. But whether this is done by putting them in separate directories or via global core/ modules distinction seems unimportant. -Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688439#action_12688439 ] Michael McCandless commented on LUCENE-1522: bq. I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly But that's really quite a serious problem; it's the kind that immediately erodes user's trust. Though if this user had used SpanScorer it would have been fixed (right?). Is there any reason not to use SpanScorer (vs QueryScorer)? The "final inch" (search UI) is exceptionally important! bq. When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. OK. bq. And I think we have positional solved fairly well with the current API - its just too darn slow. Well... I'd still like to explore some way to better integrate w/ core (just don't have enough time, but maybe if I keep talking about it here, someone else will get the itch + time ;). I think an IndexReader impl around loaded TermVectors can get us OK performance (no re-analysis nor linear scan of resynthesized TokenStream). bq. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). Well this is open source after all. Things get "naturally prioritized". bq. A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters. Yeah also another common theme in open-source development, though it's in good company: evolution and capitalism share the same "flaw". > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
>> I think we are considering this for Lucene 3.0 (should be the >> release after next) which will allow Java 1.5. > > So where are you going to put 1.6 and 1.7 contribs? This is a good point: core Lucene must remain on "old" JREs, but we should not force all contrib packages to do so. > - contrib has always had a lower bar and stuff was committed under > that lower bar - there should be no blanket promotion. OK so that was the past, and I agree. I assume by this you're also advocating that going forward this is an ongoing reason to put something into contrib? I agree with that. Ie, if a contribution is made, but it's not clear the quality is up to core's standards, I would much rather have some place to commit it (contrib) than to reject it, because once it has a home here, it has a chance to gain interest, grow, improve, etc. But: do you think, for this reason, the web site should continue to present the dichotomy? > - contrib items may have different dependencies... putting it all > under the same source root can make a developers job harder That's a good point & criterion for leaving something in contrib. > - many contrib items are less related to lucene-java core indexing > and searching... if there is no contrib, then they don't belong in > the lucene-java project at all. But most contrib packages are very related to Lucene. Though I agree some contrib packages likely have very narrow appeal/usage (eg, contrib/db, for using BDB as the raw store for an index). And I agree (as above): I would like to have somewhere for contributions to go, rather than reject them. > - right now it's clear - core can't have dependencies on non-core > classes. If everything is stuck in the same source tree, that goes > away. Well... this gets to Hoss's motivation, which I appreciate, to keep the core tiny. But that's just good software design and you don't need a divorced directory structure to achieve that. > I think there are a lot of benefits to continue considering very > carefully if something is "core" or not. I agree, but at least we need some clear criteria so the future decision process is more straightforward. Towards that... it seems like there are good reasons why something should be put into contrib: * It uses a version of JDK higher than what core can allow * It has external dependencies * Its quality is debatable (or at least not proven) * It's of somewhat narrow usage/interest (eg: contrib/bdb) But I don't think "it doesn't have to be in core" (the "software modularity" goal) is the right reason to put something in contrib. Getting back to the original topic: Trie(Numeric)RangeFilter runs on JDK 1.4, has no external dependencies, looks to be high quality, and likely will have wide appeal. Doesn't it belong in core? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688429#action_12688429 ] Eks Dev commented on LUCENE-1561: - maybe something along the lines: usePureBooleanPostings() minimalInvertedList() > Maybe rename Field.omitTf, and strengthen the javadocs > -- > > Key: LUCENE-1561 > URL: https://issues.apache.org/jira/browse/LUCENE-1561 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1561.patch > > > Spinoff from here: > > http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html > Maybe rename omitTf to something like omitTermPositions, and make it clear > what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419 ] Mark Miller edited comment on LUCENE-1522 at 3/23/09 2:12 PM: -- I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood). When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). When I wrote the SpanScorer, many times it was discussed how things should *really* be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point... I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat. *edit* A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters. was (Author: markrmil...@gmail.com): I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood). When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). When I wrote the SpanScorer, many times it was discussed how things should *really* be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point... I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat. > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-gram
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419 ] Mark Miller commented on LUCENE-1522: - I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood). When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). When I wrote the SpanScorer, many times it was discussed how things should *really* be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point... I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat. > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688409#action_12688409 ] Paul Elschot commented on LUCENE-1410: -- The encoding in the google research slides is another one. They use 2 bits prefixing the first byte and indicating the number of bytes used for the encoded number (1-4), and then they group 4 of those prefixes together to get a single byte of 4 prefixes followed by the non prefixed bytes of the 4 encoded numbers. This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and I would expect that jump to limit performance somewhat when compared to pfor that has a 32 way switch for 32/64/128 encoded numbers. But since the prefixes only indicate the numbers of bytes used for the encoded numbers, no shifts and masks are needed, only byte moves. So it could well be wortwhile to give this encoding a try, too, especially for lists of numbers shorter than 16 or 32. > PFOR implementation > --- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Paul Elschot >Priority: Minor > Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, > LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, > TestPFor2.java, TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688408#action_12688408 ] Michael McCandless commented on LUCENE-1522: Randomly searching in Google I came across this: http://stackoverflow.com/questions/82151/is-there-a-fast-accurate-highlighter-for-lucene ...which emphasizes how important it is that the highlighter only highlight "matching" fragdocs when possible. (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it). > another highlighter > --- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Koji Sekiguchi >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688387#action_12688387 ] Michael McCandless commented on LUCENE-1561: Naming is the hardest part!! > Maybe rename Field.omitTf, and strengthen the javadocs > -- > > Key: LUCENE-1561 > URL: https://issues.apache.org/jira/browse/LUCENE-1561 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1561.patch > > > Spinoff from here: > > http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html > Maybe rename omitTf to something like omitTermPositions, and make it clear > what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688385#action_12688385 ] Otis Gospodnetic commented on LUCENE-1561: -- Might be good to keep a consistent name across Lucene/Solr. More info coming up in SOLR-1079. > Maybe rename Field.omitTf, and strengthen the javadocs > -- > > Key: LUCENE-1561 > URL: https://issues.apache.org/jira/browse/LUCENE-1561 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1561.patch > > > Spinoff from here: > > http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html > Maybe rename omitTf to something like omitTermPositions, and make it clear > what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On Mon, Mar 23, 2009 at 22:13, Mark Miller wrote: > Earwin Burrfoot wrote: >>> >>> - contrib has always had a lower bar and stuff was committed under >>> that lower bar - there should be no blanket promotion. >>> - contrib items may have different dependencies... putting it all >>> under the same source root can make a developers job harder >>> - many contrib items are less related to lucene-java core indexing and >>> searching... if there is no contrib, then they don't belong in the >>> lucene-java project at all. >>> - right now it's clear - core can't have dependencies on non-core >>> classes. If everything is stuck in the same source tree, that goes >>> away. >>> >> >> Adding to this, afaik contribs have no java 1.4 restriction. If you >> merge them into the core, you must either enforce it for contribs, or >> lift it from the core. I think both variants may be a reason for >> several heart attacks :) >> One could argue that five years after 1.5 was released Lucene is going >> to use it, so the point is no longer relevant. Sorry, 1.7 is just >> behind the door. >> >> > > I think we are considering this for Lucene 3.0 (should be the release after > next) which will allow Java 1.5. So where are you going to put 1.6 and 1.7 contribs? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Earwin Burrfoot wrote: - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. Adding to this, afaik contribs have no java 1.4 restriction. If you merge them into the core, you must either enforce it for contribs, or lift it from the core. I think both variants may be a reason for several heart attacks :) One could argue that five years after 1.5 was released Lucene is going to use it, so the point is no longer relevant. Sorry, 1.7 is just behind the door. I think we are considering this for Lucene 3.0 (should be the release after next) which will allow Java 1.5. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
> - contrib has always had a lower bar and stuff was committed under > that lower bar - there should be no blanket promotion. > - contrib items may have different dependencies... putting it all > under the same source root can make a developers job harder > - many contrib items are less related to lucene-java core indexing and > searching... if there is no contrib, then they don't belong in the > lucene-java project at all. > - right now it's clear - core can't have dependencies on non-core > classes. If everything is stuck in the same source tree, that goes > away. Adding to this, afaik contribs have no java 1.4 restriction. If you merge them into the core, you must either enforce it for contribs, or lift it from the core. I think both variants may be a reason for several heart attacks :) One could argue that five years after 1.5 was released Lucene is going to use it, so the point is no longer relevant. Sorry, 1.7 is just behind the door. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Are you arguing for no change Yonik? I agree with all of your points in any case. What appeals to me most so far is: Take the best of contrib and up its status to something like "modules". Equal to core, different requirements, dependencies, etc. Perhaps take queryparser out of core, but frankly I'd wouldn't mind just leaving core as it is. Reintroduce the sandbox (I believe core was sandbox, part of the lower bar history) and put lesser contrib there and new stuff thats unproven. Contrib doesn't appeal to me as a name anyway. That would give core, modules, and the sandbox (perhaps sandbox is a module?). Things could move from sandbox to core or the modules. Modules get new requirements similar to core - back compat guarantees and changes.txt per module. Yonik Seeley wrote: On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless wrote: 4. Move contrib/* under src/java/*, updating the javadocs to state back compatibility promises per class/package. - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. I think there are a lot of benefits to continue considering very carefully if something is "core" or not. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera wrote: > As a side comment, why not add setNextReader to HitCollector and > then a getDocId(int doc) method which will do the doc + base > arithmetic? One problem is this breaks back compatibility on any current subclasses of HitCollector. Another problem is: not all collectors would need to add the base on each doc. EG a collector that puts hits into separate pqueues per segment could defer the addition until the end when only the top results are pulled out of each pqueue. Also, I am concerned about the method call overhead. This is the absolute ultimate hot spot for Lucene and we should worry about causing even a single added instruction in this path. That said... I would like to [eventually] change the collection API along the lines of what Marvin proposed for "Matcher" in Lucy, here: http://markmail.org/message/jxshhiqr6wvq77xu Specifically, I think it should be the collector's job to ask for the score for this doc, rather than Lucene's job to pre-compute it, so that collectors that don't need the score won't waste CPU. EG, if you are sorting by field (and don't present the relevance score) you shouldn't compute it. Then, we could add other "somewhat expensive" things you might retrieve, such as a way to ask which terms participated in the match (discussed today on java-user), and/or all term positions that participated (discussed in LUCENE-1522). EG, a top doc collector could choose to call these methods only when the doc was competitive. > Anyway, I don't want to add topDocs and getTotalHits to > HitCollector, it will destroy its generic purpose. I agree. > An interface is also problematic, as it just means all of these > collectors have these methods declared, but they need to implement > them. An abstract class grants you w/ both. I'm confused on this objection -- only collectors that do let you ask for the top N set of docs would implement this interface? (Ie it'd only be the TopXXXCollector's that'd implement the interface). While interfaces clearly have the future problem of back-compatibility, this case may be simple enough to make an exception. > So it looks like HitCollector itself is "deprecated" as far as the > Lucene core code sees it. I think HitCollector has a purpose, which is to be the simplest way to make a custom collector. Ie I think it makes sense to offer a simple way and a high performance way. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
ok I missed 1483 completely. As a side comment, why not add setNextReader to HitCollector and then a getDocId(int doc) method which will do the doc + base arithmetic? I think it's very easy for someone to forget to add that (+ base) to doc. You could then just change TopDocCollector to call getDocId() instead of duplicating it into TopScoreDocCollector. Isn't that something you'd want all HitCollector implementations to use? I consider some extensions of HitCollector we have - we now will probably want to change them to extend MultiReaderHitCollector, but we'll have to remember to do that +base arithmatic everywhere, instead of calling getDocId(). I understand that changing the call to getDocId is the same as adding "+ base", from an effort perspective, but I think it's better this way. It does involve an additional method call, but I wonder how good compilers will handle that. Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it will destroy its generic purpose. An interface is also problematic, as it just means all of these collectors have these methods declared, but they need to implement them. An abstract class grants you w/ both. So in case you agree that the logic of MultiReaderHitCollector can (and should?) be in HitCollector, we can create an abstract class called ScoringCollector (or if nobody objects TopDocsCollector) which will implement these two methods. In case you disagree, we can have that abstract class extend MultiReaderHitCollector instead. I'm in favor of the first option as at least as it looks in the code, HitCollector is not extended by any class anymore, except TopDocCollector which is marked as deprecated, and 3 anonymous implementations. So it looks like HitCollector itself is "deprecated" as far as the Lucene core code sees it. What do you think? Shai On Mon, Mar 23, 2009 at 4:43 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > If we're already creating a new TopScoreDocCollector (when was it > > added? I must have been dozing off while this happened...) > > This was LUCENE-1483. > > > How about if we introduce an abstract ScoringCollector (about the > > name later) which implements topDocs() and getTotalHits() and there > > will be several implementations of it, such as: > > TopScoreDocCollector, which sorts the documents by their score, in > > descending order only, TopFieldDocCollector - for sorting by fields, > > and additional sort-by collectors. > > This sounds good... but the challenge is we also need to get both > HitCollector and MultiReaderHitCollector in there. > > HitCollector is the simplest way to create a custom collector. > MultiReaderHitCollector (added with LUCENE-1483) is the more > performant way, since it lets your collector operate per-segment. All > non-deprecated core collectors in Lucene now subclass > MultiReaderHitCollector. > > So would we make separate subclasses for each of them to add > getTotalHits() / topDocs()? EG TopDocsHitCollector and > TopDocsMultiReaderHitCollector? It's getting confusing. > > Or maybe we just add totalHits() and topDocs() to HitCollector even > though for advanced case (non-top-N-collection) the methods would not > be used? > > Or... maybe this is a time when an interface is the lesser evil: we > could make a TopDocs interface that the necessary classes implement? > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
[jira] Resolved: (LUCENE-1555) Deadlock while optimize
[ https://issues.apache.org/jira/browse/LUCENE-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1555. Resolution: Incomplete Need more details here. > Deadlock while optimize > --- > > Key: LUCENE-1555 > URL: https://issues.apache.org/jira/browse/LUCENE-1555 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 > Environment: ubuntu 8.04, java 1.6 update 07, Lucene 2.4.0 >Reporter: Stefan Heidrich >Assignee: Michael McCandless > > Sometimes after starting the thread with the indexer, the thread will hang in > the following threads. > Thread [Lucene Merge Thread #0] (Ausgesetzt) > IndexWriter.commitMerge(MergePolicy$OneMerge, SegmentMerger, int) Line: > 3751 > IndexWriter.mergeMiddle(MergePolicy$OneMerge) Line: 4240 > IndexWriter.merge(MergePolicy$OneMerge) Line: 3877 > ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) Line: 205 > ConcurrentMergeScheduler$MergeThread.run() Line: 260 > Thread [Indexer] (Ausgesetzt) > Object.wait(long) Line: not available [native Methode] > IndexWriter.doWait() Line: 4491 > IndexWriter.optimize(int, boolean) Line: 2268 > IndexWriter.optimize(boolean) Line: 2203 > IndexWriter.optimize() Line: 2183 > Indexer.run() Line: 263 > If you need more informations, please let me know. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless wrote: > 4. Move contrib/* under src/java/*, updating the javadocs to state > back compatibility promises per class/package. - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. I think there are a lot of benefits to continue considering very carefully if something is "core" or not. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Michael Busch wrote: >> And I don't think the sudden separation of "core" vs "contrib" >> should be so prominent (or even visible); it's really a detail of >> how we manage source control. > >> When looking at the website I'd like read that Lucene can do hit >> highlighting, powerful query parsing, spell checking, analyze >> different languages, etc. I could care less that some of these >> happen to live under a "contrib" subdirectory somewhere in the >> source control system. > > OK, so I think we all agree about the packaging. But I believe it is > also important how the source code is organized. Maybe Lucene > consumers don't care too much, however, Lucene is an open source > project. So we also want to attract possible contributors with a > nicely organized code base. If there is a clear separation between > the different components on a source code level, becoming familiar > with Lucene as a contributor might not be so overwhelming. +1 We want the source code to be well organized: consumability by Lucene developers (not just Lucene users) is also important for Lucene's future growth. > Besides that, I think a one-to-one mapping between the packaging and > the source code has no disadvantages. (and it would certainly make > the build scripts easier!) Right. So, towards that... why even break out contrib vs core, in source control? Can't we simply migrate contrib/* into core, in the right places? >> Could we, instead, adopt some standard way (in the package >> javadocs) of stating the maturity/activity/back compat policies/etc >> of a given package? > > This makes sense; e.g. we could release new modules as beta versions > (= use at own risk, no backwards-compatibility). In fact we already have a 2.9 Jira issue opened to better document the back-compat/JDK version requirements of all packages. I think, like we've done with core lately when a new feature is added, we could have the default assumption be full back compatibility, but then those classes/methods/packages that are very new and may change simply say so clearly in their javadocs. > And if we start a new module (e.g. a GSoC project) we could exclude > it from a release easily if it's truly experimental and not in a > release-able state. Right. >> So I think the beginnings of a rough proposal is taking shape, for >>3.0: >> 1. Fix web site to give a better intro to Lucene's features, >> without exposing core vs. contrib false (to the Lucene >> consumer) > distinction >> >> 2. When releasing, we make a single JAR holding core & contrib >> classes for a given area. The final JAR files don't contain a >> "core" vs "contrib" distinction. >> >> 3. We create a "bundled" JAR that has the common packages >> "typically" needed (index/search core, analyzers, queries, >> highlighter, spellchecker) > > +1 to all three points. OK. So I guess I'm proposing adding: 4. Move contrib/* under src/java/*, updating the javadocs to state back compatibility promises per class/package. I think net/net this'd be a great simplification? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
> If we're already creating a new TopScoreDocCollector (when was it > added? I must have been dozing off while this happened...) This was LUCENE-1483. > How about if we introduce an abstract ScoringCollector (about the > name later) which implements topDocs() and getTotalHits() and there > will be several implementations of it, such as: > TopScoreDocCollector, which sorts the documents by their score, in > descending order only, TopFieldDocCollector - for sorting by fields, > and additional sort-by collectors. This sounds good... but the challenge is we also need to get both HitCollector and MultiReaderHitCollector in there. HitCollector is the simplest way to create a custom collector. MultiReaderHitCollector (added with LUCENE-1483) is the more performant way, since it lets your collector operate per-segment. All non-deprecated core collectors in Lucene now subclass MultiReaderHitCollector. So would we make separate subclasses for each of them to add getTotalHits() / topDocs()? EG TopDocsHitCollector and TopDocsMultiReaderHitCollector? It's getting confusing. Or maybe we just add totalHits() and topDocs() to HitCollector even though for advanced case (non-top-N-collection) the methods would not be used? Or... maybe this is a time when an interface is the lesser evil: we could make a TopDocs interface that the necessary classes implement? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688284#action_12688284 ] Eks Dev commented on LUCENE-1410: - It looks like Google went there as well (Block encoding), see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63) > PFOR implementation > --- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Paul Elschot >Priority: Minor > Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, > LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, > TestPFor2.java, TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org