Re: LocalLucene sorting issue

2009-03-23 Thread Mark Miller

Ryan McKinley wrote:
In order to get spatial lucene into solr, we need to figure out how to 
fix the memory leak described in:

https://issues.apache.org/jira/browse/LUCENE-1304

Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as 
the _real_ solution while LUCENE-1304 would just be a deprecated 
band-aid (for the record, band-aids are quite useful).


Before delving into this again, it looks like LUCENE-1483 is finished, 
but I don't understand how it fixes the CustomSort stuff.  Also, I 
don't see what the deprecated sorting stuff should be replaced with...
The fix should be that comparators are no longer cached with LUCENE-1483 
as long as you use the new API. The new API is the FieldComparator, and 
you supply one with a FieldComparatorSource. The FieldComparator may 
look a little complicated, but its fairly straightforward for the 
primitive (non String) types - you should be able to roughly copy one.


org.apache.lucene.search.FieldComparator

There is a new SortField constructor that takes a FieldComparatorSource.


thanks for any pointers

ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688494#action_12688494
 ] 

Mark Miller commented on LUCENE-1570:
-

Yonik spit out a bit of a better answer while I typed - right, you do have 
access to the field in getWildcardQuery, and the leading check happens there, 
so you can override it. My brain always runs towards building the support in, 
but in this case it may be clear to leave it out anyway. Its somewhat of a 
niche concern. Just had the new QueryParser on my mind.

> QueryParser.setAllowLeadingWildcard could provide finer granularity
> ---
>
> Key: LUCENE-1570
> URL: https://issues.apache.org/jira/browse/LUCENE-1570
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4.1
>Reporter: Jonathan Watt
>
> It's great that Lucene now allows support for leading wildcards to be turned 
> on. However, leading wildcard searches are more expensive, so it would be 
> useful to be able to turn it on only for certain search fields. I'm 
> specifically thinking of wiki searches where it may be too expensive to allow 
> leading wildcards in the 'content:' field, but it would still be very useful 
> to be able to selectively turn on support for 'path:' and perhaps other 
> fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688486#action_12688486
 ] 

Mark Miller commented on LUCENE-1570:
-

I've wanted this in the past. Its certainly possible, but I am not sure how 
easy it would be to do with the current queryparser (been a long time since I 
have been there). There appears to be a new parser on the horizon though, and 
it sounds as if it will allow these types of additions much more elegantly (the 
current queryparser does not use a syntax tree representation, and its kind of 
hairy to build on).

If I remember right, the current QueryParser simply attaches semantic actions 
to grammar production rules - difficult to read, edit, and maintain - has not 
been super friendly for building upon.

Also if I remember right, I think this new parser will use abstract syntax 
trees, which lets you split up syntax and semantics, and also keep things a bit 
more modular - you can do things like have pluggable syntax reader that feeds 
pluggable query output writer. At least for the basics - it sounds like these 
guys have made something pretty cool, but I have not seen the code yet and have 
only a brief memory of its description.

Point being, it can be done, I think its useful, but it might make sense to see 
how much easier it can be done with this new parser.

> QueryParser.setAllowLeadingWildcard could provide finer granularity
> ---
>
> Key: LUCENE-1570
> URL: https://issues.apache.org/jira/browse/LUCENE-1570
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4.1
>Reporter: Jonathan Watt
>
> It's great that Lucene now allows support for leading wildcards to be turned 
> on. However, leading wildcard searches are more expensive, so it would be 
> useful to be able to turn it on only for certain search fields. I'm 
> specifically thinking of wiki searches where it may be too expensive to allow 
> leading wildcards in the 'content:' field, but it would still be very useful 
> to be able to selectively turn on support for 'path:' and perhaps other 
> fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688482#action_12688482
 ] 

Yonik Seeley commented on LUCENE-1570:
--

This is pretty easy to implement by overriding QueryParser.getWildcardQuery().

> QueryParser.setAllowLeadingWildcard could provide finer granularity
> ---
>
> Key: LUCENE-1570
> URL: https://issues.apache.org/jira/browse/LUCENE-1570
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4.1
>Reporter: Jonathan Watt
>
> It's great that Lucene now allows support for leading wildcards to be turned 
> on. However, leading wildcard searches are more expensive, so it would be 
> useful to be able to turn it on only for certain search fields. I'm 
> specifically thinking of wiki searches where it may be too expensive to allow 
> leading wildcards in the 'content:' field, but it would still be very useful 
> to be able to selectively turn on support for 'path:' and perhaps other 
> fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



LocalLucene sorting issue

2009-03-23 Thread Ryan McKinley
In order to get spatial lucene into solr, we need to figure out how to  
fix the memory leak described in:

https://issues.apache.org/jira/browse/LUCENE-1304

Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as  
the _real_ solution while LUCENE-1304 would just be a deprecated band- 
aid (for the record, band-aids are quite useful).


Before delving into this again, it looks like LUCENE-1483 is finished,  
but I don't understand how it fixes the CustomSort stuff.  Also, I  
don't see what the deprecated sorting stuff should be replaced with...


thanks for any pointers

ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Jonathan Watt (JIRA)
QueryParser.setAllowLeadingWildcard could provide finer granularity
---

 Key: LUCENE-1570
 URL: https://issues.apache.org/jira/browse/LUCENE-1570
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4.1
Reporter: Jonathan Watt


It's great that Lucene now allows support for leading wildcards to be turned 
on. However, leading wildcard searches are more expensive, so it would be 
useful to be able to turn it on only for certain search fields. I'm 
specifically thinking of wiki searches where it may be too expensive to allow 
leading wildcards in the 'content:' field, but it would still be very useful to 
be able to selectively turn on support for 'path:' and perhaps other fields 
such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Improve worst-case performance of TrieRange queries

2009-03-23 Thread Michael Busch

Let me give an example to explain my idea - I'm using dates in my
example, because it's easier to imagine :)

Let's say we have the following posting lists. There are 20 docs in the
index and an X means that a doc contains the corresponding term:

JanX   X
Feb XX  X
Mar  X
AprXX
MayX
Jun
Jul   XX
Aug   X  X
Sep   X
Oct   X
Nov  X  X
Dec X X

Then we index another term 'ALL'. It gets added for any document that 
has a numeric value in this bucket:


All X XX

If the query is [Jun TO Jul], then we process the query normally (ORing 
terms Jun and Jul). If the query is [Feb TO Nov], then we basically 
translate that into All AND NOT (Jan OR Dec).


Since you only evaluate the complement of the terms, you can (almost) 
double the worst case performance.


Downsides:
- you have to have another BitSet in memory to perform the AND NOT 
operation, so it needs more memory
- this complement approach is only this simple for numeric fields where 
one document has only a single value; similar things are doable for 
multi-valued numeric fields, however more complex and possibly with less 
performance gain
- you need to index an additional term per bucket, so the index size 
increases slightly


Does this make sense? Maybe this has even been discussed already?

-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688454#action_12688454
 ] 

Michael McCandless commented on LUCENE-1522:


bq. I think this is an unrealistic requirement in some cases (e.g. AND queries).

I agree.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688451#action_12688451
 ] 

Michael Busch commented on LUCENE-1522:
---

{quote}
(Meaning, if you were to copy & paste the full excerpt you are looking at, 
index it as a document, would your current search match it).
{quote}

I think this is an unrealistic requirement in some cases (e.g. AND queries). I 
agree it makes sense for phrases to show them entirely in a fragment (even if 
that means not to show the beginning of a sentence). But often you have only 
one or two lines of text to display an extract. Then it might be a better 
choice to show two decently sized fragments with some context around the 
highlighted terms, rather than showing e.g. 4 short fragments just to show all 
4 highlighted query terms (e.g. for query '+a +b +c +d')



> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688449#action_12688449
 ] 

Mike Klaas commented on LUCENE-1561:


I agree that it is going to be almost impossible to convey that phrase queries 
don't work by renaming the flag.  I agree with Eks Dev that a positive 
formulation is the only chance, although this deviates from the current omit* 
flags.

termPresenceOnly()
trackTermPresenceOnly()
onlyTermPresence()
omitEverythingButTermPresence() // just kidding


> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688448#action_12688448
 ] 

Mark Miller commented on LUCENE-1522:
-

{quote}But that's really quite a serious problem; it's the kind that
immediately erodes user's trust. Though if this user had used
SpanScorer it would have been fixed (right?).{quote}

Right - my point was more that it was a common complaint and has been solved in 
one way or another for a long time. Even back when that post occured, there was 
a JIRA highlighter that worked with phrase queries I think. There have been at 
least one or two besides the SpanScorer.

{quote}Is there any reason not to use SpanScorer (vs QueryScorer)?{quote}

It is slower when working with position sensitive clauses - because it actually 
does some work. For non position sensitive terms, its the same speed as the 
standard. Makes sense to me to always use it, but if you don't care and want 
every term highlighted, why pay the price I guess...

{quote}
Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time .
{quote}

Right - don't get me wrong - I was just getting thoughts in my head down. These 
types of brain dumps you higher level guys do def leads to work getting done - 
the SpanScorer came directly from these types of discussions, and quite a bit 
later - the original discussion happened before my time.

{quote}
Well this is open source after all. Things get "naturally
prioritized".

A lot of the sweat that is given has been fragmented by the 3 or 4 
alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same "flaw".
{quote}

Right. I suppose I was just suggesting that something more practical might make 
more sense (more musing than suggesting). And practical in terms of how much 
activity we have seen in the highlighter area (fairly low, and not usually to 
the extent needed to get something committed and in use).

And the split work on the highlighters is fine - but if we had the right 
highlighter base, more work could have been concentrated on the highlighter 
thats most used. Not really a complaint, but idea for the future. If we can get 
something better going, perhaps we can get to the point were people work with 
the current implementation rather than creating a new one every time.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@l

Re: Modularization

2009-03-23 Thread Mike Klaas


On 23-Mar-09, at 2:41 PM, Michael McCandless wrote:


I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

 * It uses a version of JDK higher than what core can allow

 * It has external dependencies

 * Its quality is debatable (or at least not proven)

 * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think "it doesn't have to be in core" (the "software
modularity" goal) is the right reason to put something in contrib.


Agreed.  I don't think that building on the existing 'contrib' is the  
way to go.  Frequently-used, high-quality components should be more  
properly part of "Lucene", whether that means that they move to core,  
or in a new blessed modules section.



Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?


+1.  It is important that Lucene come blessed with very good quality  
defaults.  Fast range queries are a common requirement.  Similarly, I  
wouldn't be happy to have a new, wicked QueryParser be relegated to  
contrib where it is unlikely to be found by non-savvy users.  At the  
very least, I agree with Michael that it should be findable in the  
same "place".


It does make sense to separate the machinery/building blocks (base  
Query, Weight, Scorer, Filter classes, Similarity interface, etc.)  
from the Query/Filter implementations that use them.  But whether this  
is done by putting them in separate directories or via global core/ 
modules distinction seems unimportant.


-Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688439#action_12688439
 ] 

Michael McCandless commented on LUCENE-1522:


bq. I think you are reading more into that than I see - that guy is just 
frustrated that PhraseQueries don't highlight correctly

But that's really quite a serious problem; it's the kind that
immediately erodes user's trust.  Though if this user had used
SpanScorer it would have been fixed (right?).

Is there any reason not to use SpanScorer (vs QueryScorer)?

The "final inch" (search UI) is exceptionally important!

bq. When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really.

OK.

bq. And I think we have positional solved fairly well with the current API - 
its just too darn slow.

Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time ;).

I think an IndexReader impl around loaded TermVectors can get us OK
performance (no re-analysis nor linear scan of resynthesized
TokenStream).

bq. Not that I am against things being sweet and perfect, and getting exact 
matches, but there has been lots of talk in the past about integrating the 
highlighter into core and making things really fast and efficient - and 
generally it comes down to what work actually gets done (and all this stuff 
ends up at the hard end of the pool).

Well this is open source after all.  Things get "naturally
prioritized".

bq. A lot of the sweat that is given has been fragmented by the 3 or 4 
alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same "flaw".


> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
>> I think we are considering this for Lucene 3.0 (should be the
>> release after next) which will allow Java 1.5.
>
> So where are you going to put 1.6 and 1.7 contribs?

This is a good point: core Lucene must remain on "old" JREs, but we
should not force all contrib packages to do so.

> - contrib has always had a lower bar and stuff was committed under
> that lower bar - there should be no blanket promotion.

OK so that was the past, and I agree.

I assume by this you're also advocating that going forward this is an
ongoing reason to put something into contrib?  I agree with that. Ie,
if a contribution is made, but it's not clear the quality is up to
core's standards, I would much rather have some place to commit it
(contrib) than to reject it, because once it has a home here, it has a
chance to gain interest, grow, improve, etc.

But: do you think, for this reason, the web site should continue to
present the dichotomy?

> - contrib items may have different dependencies... putting it all
> under the same source root can make a developers job harder

That's a good point & criterion for leaving something in contrib.

> - many contrib items are less related to lucene-java core indexing
> and searching... if there is no contrib, then they don't belong in
> the lucene-java project at all.

But most contrib packages are very related to Lucene.

Though I agree some contrib packages likely have very narrow
appeal/usage (eg, contrib/db, for using BDB as the raw store for an
index).

And I agree (as above): I would like to have somewhere for
contributions to go, rather than reject them.

> - right now it's clear - core can't have dependencies on non-core
> classes.  If everything is stuck in the same source tree, that goes
> away.

Well... this gets to Hoss's motivation, which I appreciate, to keep
the core tiny.

But that's just good software design and you don't need a divorced
directory structure to achieve that.

> I think there are a lot of benefits to continue considering very
> carefully if something is "core" or not.

I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

  * It uses a version of JDK higher than what core can allow

  * It has external dependencies

  * Its quality is debatable (or at least not proven)

  * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think "it doesn't have to be in core" (the "software
modularity" goal) is the right reason to put something in contrib.

Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688429#action_12688429
 ] 

Eks Dev commented on LUCENE-1561:
-

maybe something along the lines:

usePureBooleanPostings()
minimalInvertedList()




> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419
 ] 

Mark Miller edited comment on LUCENE-1522 at 3/23/09 2:12 PM:
--

I think you are reading more into that than I see - that guy is just frustrated 
that PhraseQueries don't highlight correctly. That was/is a common occurrence 
and you can find tons of examples. There are one or two JIRA highlighters that 
address it, and the their is the Span highlighter (more interestingly, there is 
a link to the birth of the Span highlighter idea on that page - thanks M. 
Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.

*edit*
A lot of the sweat that is given has been fragmented by the 3 or 4 alternate 
highlighters.

  was (Author: markrmil...@gmail.com):
I think you are reading more into that than I see - that guy is just 
frustrated that PhraseQueries don't highlight correctly. That was/is a common 
occurrence and you can find tons of examples. There are one or two JIRA 
highlighters that address it, and the their is the Span highlighter (more 
interestingly, there is a link to the birth of the Span highlighter idea on 
that page - thanks M. Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.
  
> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-gram

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419
 ] 

Mark Miller commented on LUCENE-1522:
-

I think you are reading more into that than I see - that guy is just frustrated 
that PhraseQueries don't highlight correctly. That was/is a common occurrence 
and you can find tons of examples. There are one or two JIRA highlighters that 
address it, and the their is the Span highlighter (more interestingly, there is 
a link to the birth of the Span highlighter idea on that page - thanks M. 
Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2009-03-23 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688409#action_12688409
 ] 

Paul Elschot commented on LUCENE-1410:
--

The encoding in the google research slides is another one.
They use 2 bits prefixing the first byte and indicating the number of bytes 
used for the encoded number (1-4), and then they group 4 of those prefixes 
together to get a single byte of 4 prefixes followed by the non prefixed bytes 
of the 4 encoded numbers.
This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and 
I would expect that jump to limit performance somewhat when compared to pfor 
that has a 32 way switch for 32/64/128 encoded numbers.
But since the prefixes only indicate the numbers of bytes used for the encoded 
numbers, no shifts and masks are needed, only byte moves.
So it could well be wortwhile to give this encoding a try, too, especially for 
lists of numbers shorter than 16 or 32.

> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
> TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688408#action_12688408
 ] 

Michael McCandless commented on LUCENE-1522:


Randomly searching in Google I came across this:


http://stackoverflow.com/questions/82151/is-there-a-fast-accurate-highlighter-for-lucene

...which emphasizes how important it is that the highlighter only highlight 
"matching" fragdocs when possible.

(Meaning, if you were to copy & paste the full excerpt you are looking at, 
index it as a document, would your current search match it).

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688387#action_12688387
 ] 

Michael McCandless commented on LUCENE-1561:


Naming is the hardest part!!

> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688385#action_12688385
 ] 

Otis Gospodnetic commented on LUCENE-1561:
--

Might be good to keep a consistent name across Lucene/Solr.
More info coming up in SOLR-1079.


> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
On Mon, Mar 23, 2009 at 22:13, Mark Miller  wrote:
> Earwin Burrfoot wrote:
>>>
>>> - contrib has always had a lower bar and stuff was committed under
>>> that lower bar - there should be no blanket promotion.
>>> - contrib items may have different dependencies... putting it all
>>> under the same source root can make a developers job harder
>>> - many contrib items are less related to lucene-java core indexing and
>>> searching... if there is no contrib, then they don't belong in the
>>> lucene-java project at all.
>>> - right now it's clear - core can't have dependencies on non-core
>>> classes.  If everything is stuck in the same source tree, that goes
>>> away.
>>>
>>
>> Adding to this, afaik contribs have no java 1.4 restriction. If you
>> merge them into the core, you must either enforce it for contribs, or
>> lift it from the core. I think both variants may be a reason for
>> several heart attacks :)
>> One could argue that five years after 1.5 was released Lucene is going
>> to use it, so the point is no longer relevant. Sorry, 1.7 is just
>> behind the door.
>>
>>
>
> I think we are considering this for Lucene 3.0 (should be the release after
> next) which will allow Java 1.5.

So where are you going to put 1.6 and 1.7 contribs?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller

Earwin Burrfoot wrote:

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.


Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

  
I think we are considering this for Lucene 3.0 (should be the release 
after next) which will allow Java 1.5.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
> - contrib has always had a lower bar and stuff was committed under
> that lower bar - there should be no blanket promotion.
> - contrib items may have different dependencies... putting it all
> under the same source root can make a developers job harder
> - many contrib items are less related to lucene-java core indexing and
> searching... if there is no contrib, then they don't belong in the
> lucene-java project at all.
> - right now it's clear - core can't have dependencies on non-core
> classes.  If everything is stuck in the same source tree, that goes
> away.
Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller
Are you arguing for no change Yonik? I agree with all of your points in 
any case.


What appeals to me most so far is:

Take the best of contrib and up its status to something like "modules". 
Equal to core, different requirements, dependencies, etc. Perhaps take 
queryparser out of core, but frankly I'd wouldn't mind just leaving core 
as it is.


Reintroduce the sandbox (I believe core was sandbox, part of the lower 
bar history) and put lesser contrib there and new stuff thats unproven. 
Contrib doesn't appeal to me as a name anyway.


That would give core, modules, and the sandbox (perhaps sandbox is a 
module?). Things could move from sandbox to core or the modules. Modules 
get new requirements similar to core - back compat guarantees and 
changes.txt per module.



Yonik Seeley wrote:

On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
 wrote:
  

  4. Move contrib/* under src/java/*, updating the javadocs to state
  back compatibility promises per class/package.



- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is "core" or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
Shai Erera  wrote:

> As a side comment, why not add setNextReader to HitCollector and
> then a getDocId(int doc) method which will do the doc + base
> arithmetic?

One problem is this breaks back compatibility on any current
subclasses of HitCollector.

Another problem is: not all collectors would need to add the base on
each doc.  EG a collector that puts hits into separate pqueues per
segment could defer the addition until the end when only the top
results are pulled out of each pqueue.

Also, I am concerned about the method call overhead.  This is the
absolute ultimate hot spot for Lucene and we should worry about
causing even a single added instruction in this path.

That said... I would like to [eventually] change the collection API
along the lines of what Marvin proposed for "Matcher" in Lucy, here:

  http://markmail.org/message/jxshhiqr6wvq77xu

Specifically, I think it should be the collector's job to ask for the
score for this doc, rather than Lucene's job to pre-compute it, so
that collectors that don't need the score won't waste CPU.  EG, if you
are sorting by field (and don't present the relevance score) you
shouldn't compute it.

Then, we could add other "somewhat expensive" things you might
retrieve, such as a way to ask which terms participated in the match
(discussed today on java-user), and/or all term positions that
participated (discussed in LUCENE-1522).  EG, a top doc collector
could choose to call these methods only when the doc was competitive.

> Anyway, I don't want to add topDocs and getTotalHits to
> HitCollector, it will destroy its generic purpose.

I agree.

> An interface is also problematic, as it just means all of these
> collectors have these methods declared, but they need to implement
> them. An abstract class grants you w/ both.

I'm confused on this objection -- only collectors that do let you ask
for the top N set of docs would implement this interface?  (Ie it'd
only be the TopXXXCollector's that'd implement the interface).  While
interfaces clearly have the future problem of back-compatibility, this
case may be simple enough to make an exception.

> So it looks like HitCollector itself is "deprecated" as far as the
> Lucene core code sees it.

I think HitCollector has a purpose, which is to be the simplest way to
make a custom collector.  Ie I think it makes sense to offer a simple
way and a high performance way.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Shai Erera
ok I missed 1483 completely.

As a side comment, why not add setNextReader to HitCollector and then a
getDocId(int doc) method which will do the doc + base arithmetic? I think
it's very easy for someone to forget to add that (+ base) to doc. You could
then just change TopDocCollector to call getDocId() instead of duplicating
it into TopScoreDocCollector.

Isn't that something you'd want all HitCollector implementations to use? I
consider some extensions of HitCollector we have - we now will probably want
to change them to extend MultiReaderHitCollector, but we'll have to remember
to do that +base arithmatic everywhere, instead of calling getDocId(). I
understand that changing the call to getDocId is the same as adding "+
base", from an effort perspective, but I think it's better this way. It does
involve an additional method call, but I wonder how good compilers will
handle that.

Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it
will destroy its generic purpose. An interface is also problematic, as it
just means all of these collectors have these methods declared, but they
need to implement them. An abstract class grants you w/ both.

So in case you agree that the logic of MultiReaderHitCollector can (and
should?) be in HitCollector, we can create an abstract class called
ScoringCollector (or if nobody objects TopDocsCollector) which will
implement these two methods.
In case you disagree, we can have that abstract class extend
MultiReaderHitCollector instead.

I'm in favor of the first option as at least as it looks in the code,
HitCollector is not extended by any class anymore, except TopDocCollector
which is marked as deprecated, and 3 anonymous implementations. So it looks
like HitCollector itself is "deprecated" as far as the Lucene core code sees
it.

What do you think?

Shai

On Mon, Mar 23, 2009 at 4:43 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> > If we're already creating a new TopScoreDocCollector (when was it
> > added?  I must have been dozing off while this happened...)
>
> This was LUCENE-1483.
>
> > How about if we introduce an abstract ScoringCollector (about the
> > name later) which implements topDocs() and getTotalHits() and there
> > will be several implementations of it, such as:
> > TopScoreDocCollector, which sorts the documents by their score, in
> > descending order only, TopFieldDocCollector - for sorting by fields,
> > and additional sort-by collectors.
>
> This sounds good... but the challenge is we also need to get both
> HitCollector and MultiReaderHitCollector in there.
>
> HitCollector is the simplest way to create a custom collector.
> MultiReaderHitCollector (added with LUCENE-1483) is the more
> performant way, since it lets your collector operate per-segment.  All
> non-deprecated core collectors in Lucene now subclass
> MultiReaderHitCollector.
>
> So would we make separate subclasses for each of them to add
> getTotalHits() / topDocs()?  EG TopDocsHitCollector and
> TopDocsMultiReaderHitCollector?  It's getting confusing.
>
> Or maybe we just add totalHits() and topDocs() to HitCollector even
> though for advanced case (non-top-N-collection) the methods would not
> be used?
>
> Or... maybe this is a time when an interface is the lesser evil: we
> could make a TopDocs interface that the necessary classes implement?
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Resolved: (LUCENE-1555) Deadlock while optimize

2009-03-23 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1555.


Resolution: Incomplete

Need more details here.

> Deadlock while optimize
> ---
>
> Key: LUCENE-1555
> URL: https://issues.apache.org/jira/browse/LUCENE-1555
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
> Environment: ubuntu 8.04, java 1.6 update 07, Lucene 2.4.0
>Reporter: Stefan Heidrich
>Assignee: Michael McCandless
>
> Sometimes after starting the thread with the indexer, the thread will hang in 
> the following threads.
> Thread [Lucene Merge Thread #0] (Ausgesetzt)  
>   IndexWriter.commitMerge(MergePolicy$OneMerge, SegmentMerger, int) Line: 
> 3751
>   IndexWriter.mergeMiddle(MergePolicy$OneMerge) Line: 4240
>   IndexWriter.merge(MergePolicy$OneMerge) Line: 3877  
>   ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) Line: 205
>   ConcurrentMergeScheduler$MergeThread.run() Line: 260
> Thread [Indexer] (Ausgesetzt) 
>   Object.wait(long) Line: not available [native Methode]  
>   IndexWriter.doWait() Line: 4491 
>   IndexWriter.optimize(int, boolean) Line: 2268   
>   IndexWriter.optimize(boolean) Line: 2203
>   IndexWriter.optimize() Line: 2183   
>   Indexer.run() Line: 263 
> If you need more informations, please let me know.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Yonik Seeley
On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
 wrote:
>   4. Move contrib/* under src/java/*, updating the javadocs to state
>       back compatibility promises per class/package.

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is "core" or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
Michael Busch  wrote:

>> And I don't think the sudden separation of "core" vs "contrib"
>> should be so prominent (or even visible); it's really a detail of
>> how we manage source control.
>
>> When looking at the website I'd like read that Lucene can do hit
>> highlighting, powerful query parsing, spell checking, analyze
>> different languages, etc.  I could care less that some of these
>> happen to live under a "contrib" subdirectory somewhere in the
>> source control system.
>
> OK, so I think we all agree about the packaging. But I believe it is
> also important how the source code is organized. Maybe Lucene
> consumers don't care too much, however, Lucene is an open source
> project. So we also want to attract possible contributors with a
> nicely organized code base. If there is a clear separation between
> the different components on a source code level, becoming familiar
> with Lucene as a contributor might not be so overwhelming.

+1

We want the source code to be well organized: consumability by Lucene
developers (not just Lucene users) is also important for Lucene's
future growth.

> Besides that, I think a one-to-one mapping between the packaging and
> the source code has no disadvantages. (and it would certainly make
> the build scripts easier!)

Right.

So, towards that... why even break out contrib vs core, in source
control?  Can't we simply migrate contrib/* into core, in the right
places?

>> Could we, instead, adopt some standard way (in the package
>> javadocs) of stating the maturity/activity/back compat policies/etc
>> of a given package?
>
> This makes sense; e.g. we could release new modules as beta versions
> (= use at own risk, no backwards-compatibility).

In fact we already have a 2.9 Jira issue opened to better document the
back-compat/JDK version requirements of all packages.

I think, like we've done with core lately when a new feature is added,
we could have the default assumption be full back compatibility, but
then those classes/methods/packages that are very new and may change
simply say so clearly in their javadocs.

> And if we start a new module (e.g. a GSoC project) we could exclude
> it from a release easily if it's truly experimental and not in a
> release-able state.

Right.

>> So I think the beginnings of a rough proposal is taking shape, for
>>3.0:

>>   1. Fix web site to give a better intro to Lucene's features,
>>   without exposing core vs. contrib false (to the Lucene
>>   consumer) > distinction
>>
>>   2. When releasing, we make a single JAR holding core & contrib
>>   classes for a given area.  The final JAR files don't contain a
>>   "core" vs "contrib" distinction.
>>
>>   3. We create a "bundled" JAR that has the common packages
>>   "typically" needed (index/search core, analyzers, queries,
>>   highlighter, spellchecker)
>
> +1 to all three points.

OK.

So I guess I'm proposing adding:

   4. Move contrib/* under src/java/*, updating the javadocs to state
   back compatibility promises per class/package.

I think net/net this'd be a great simplification?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
> If we're already creating a new TopScoreDocCollector (when was it
> added?  I must have been dozing off while this happened...)

This was LUCENE-1483.

> How about if we introduce an abstract ScoringCollector (about the
> name later) which implements topDocs() and getTotalHits() and there
> will be several implementations of it, such as:
> TopScoreDocCollector, which sorts the documents by their score, in
> descending order only, TopFieldDocCollector - for sorting by fields,
> and additional sort-by collectors.

This sounds good... but the challenge is we also need to get both
HitCollector and MultiReaderHitCollector in there.

HitCollector is the simplest way to create a custom collector.
MultiReaderHitCollector (added with LUCENE-1483) is the more
performant way, since it lets your collector operate per-segment.  All
non-deprecated core collectors in Lucene now subclass
MultiReaderHitCollector.

So would we make separate subclasses for each of them to add
getTotalHits() / topDocs()?  EG TopDocsHitCollector and
TopDocsMultiReaderHitCollector?  It's getting confusing.

Or maybe we just add totalHits() and topDocs() to HitCollector even
though for advanced case (non-top-N-collection) the methods would not
be used?

Or... maybe this is a time when an interface is the lesser evil: we
could make a TopDocs interface that the necessary classes implement?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2009-03-23 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688284#action_12688284
 ] 

Eks Dev commented on LUCENE-1410:
-

It looks like Google went there as well (Block encoding), 

see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format
http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)



> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
> TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org