date:20090318

[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Steven Rowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1435:


Attachment: LUCENE-1435.patch

New patch that compiles.

I'm not sure how this ever worked previously - I must somehow have had 
lucene-misc-X.jar on the classpath or something.

Anyway, the build.xml in this patch, cribbing from contrib/benchmark/build.xml, 
first builds contrib/miscellaneous, then adds 
build/contrib/miscellaneous/classes/java/ to the classpath, so that 
AnalyzingQueryParser can be linked against.

Everything now compiles, and all contrib tests pass.

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch, 
> LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683313#action_12683313
 ] 

Adriano Crestani commented on LUCENE-1567:
--

It's probably not ok, since lucene build script will probably fail because of 
that. We are working on a patch which we will upload to this JIRA soon, it will 
only be for the community to review the new query parser code and not to be 
committed against the trunk. I think somebody could create a sandbox and commit 
the code, it would be easier for other to review the new query parser.

I think the right question is if we should include this new parser in the 
release 2.9, if yes, then we definitely need to change the code to be java 1.4 
compatible. Anyway, before taking this decision, the code must be available for 
the community : )

Best Regards,

> New flexible query parser
> -
>
> Key: LUCENE-1567
> URL: https://issues.apache.org/jira/browse/LUCENE-1567
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: QueryParser
> Environment: N/A
>Reporter: Luis Alves
>
> From "New flexible query parser" thread by Micheal Busch
> in my team at IBM we have used a different query parser than Lucene's in
> our products for quite a while. Recently we spent a significant amount
> of time in refactoring the code and designing a very generic
> architecture, so that this query parser can be easily used for different
> products with varying query syntaxes.
> This work was originally driven by Andreas Neumann (who, however, left
> our team); most of the code was written by Luis Alves, who has been a
> bit active in Lucene in the past, and Adriano Campos, who joined our
> team at IBM half a year ago. Adriano is Apache committer and PMC member
> on the Tuscany project and getting familiar with Lucene now too.
> We think this code is much more flexible and extensible than the current
> Lucene query parser, and would therefore like to contribute it to
> Lucene. I'd like to give a very brief architecture overview here,
> Adriano and Luis can then answer more detailed questions as they're much
> more familiar with the code than I am.
> The goal was it to separate syntax and semantics of a query. E.g. 'a AND
> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
> We distinguish the semantics of the different query components, e.g.
> whether and how to tokenize/lemmatize/normalize the different terms or
> which Query objects to create for the terms. We wanted to be able to
> write a parser with a new syntax, while reusing the underlying
> semantics, as quickly as possible.
> In fact, Adriano is currently working on a 100% Lucene-syntax compatible
> implementation to make it easy for people who are using Lucene's query
> parser to switch.
> The query parser has three layers and its core is what we call the
> QueryNodeTree. It is a tree that initially represents the syntax of the
> original query, e.g. for 'a AND b':
>   AND
>  /   \
> A B
> The three layers are:
> 1. QueryParser
> 2. QueryNodeProcessor
> 3. QueryBuilder
> 1. The upper layer is the parsing layer which simply transforms the
> query text string into a QueryNodeTree. Currently our implementations of
> this layer use javacc.
> 2. The query node processors do most of the work. It is in fact a
> configurable chain of processors. Each processors can walk the tree and
> modify nodes or even the tree's structure. That makes it possible to
> e.g. do query optimization before the query is executed or to tokenize
> terms.
> 3. The third layer is also a configurable chain of builders, which
> transform the QueryNodeTree into Lucene Query objects.
> Furthermore the query parser uses flexible configuration objects, which
> are based on AttributeSource/Attribute. It also uses message classes that
> allow to attach resource bundles. This makes it possible to translate
> messages, which is an important feature of a query parser.
> This design allows us to develop different query syntaxes very quickly.
> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
> underlying processors and builders in a few days. We now have a 100%
> compatible Lucene query parser, which means the syntax is identical and
> all query parser test cases pass on the new one too using a wrapper.
> Recent posts show that there is demand for query syntax improvements,
> e.g improved range query syntax or operator precedence. There are
> already different QP implementations in Lucene+contrib, however I think
> we did not keep them all up to date and in sync. This is not too
> surprising, because usually when fixes and changes are made to the main
> query parser, people don't make the corresponding changes in the contrib
> parsers. (I'm guilty here too)
> With this new architecture it will be much

[jira] Commented: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Luis Alves (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683308#action_12683308
 ] 

Luis Alves commented on LUCENE-1567:


Should the Flexible Query Parser patch be committed to the main,
as a replacement for the old queryparser? 

The current implementation is using Java 1.5 syntax.
Is that OK, if we commit it to the trunk.



> New flexible query parser
> -
>
> Key: LUCENE-1567
> URL: https://issues.apache.org/jira/browse/LUCENE-1567
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: QueryParser
> Environment: N/A
>Reporter: Luis Alves
>
> From "New flexible query parser" thread by Micheal Busch
> in my team at IBM we have used a different query parser than Lucene's in
> our products for quite a while. Recently we spent a significant amount
> of time in refactoring the code and designing a very generic
> architecture, so that this query parser can be easily used for different
> products with varying query syntaxes.
> This work was originally driven by Andreas Neumann (who, however, left
> our team); most of the code was written by Luis Alves, who has been a
> bit active in Lucene in the past, and Adriano Campos, who joined our
> team at IBM half a year ago. Adriano is Apache committer and PMC member
> on the Tuscany project and getting familiar with Lucene now too.
> We think this code is much more flexible and extensible than the current
> Lucene query parser, and would therefore like to contribute it to
> Lucene. I'd like to give a very brief architecture overview here,
> Adriano and Luis can then answer more detailed questions as they're much
> more familiar with the code than I am.
> The goal was it to separate syntax and semantics of a query. E.g. 'a AND
> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
> We distinguish the semantics of the different query components, e.g.
> whether and how to tokenize/lemmatize/normalize the different terms or
> which Query objects to create for the terms. We wanted to be able to
> write a parser with a new syntax, while reusing the underlying
> semantics, as quickly as possible.
> In fact, Adriano is currently working on a 100% Lucene-syntax compatible
> implementation to make it easy for people who are using Lucene's query
> parser to switch.
> The query parser has three layers and its core is what we call the
> QueryNodeTree. It is a tree that initially represents the syntax of the
> original query, e.g. for 'a AND b':
>   AND
>  /   \
> A B
> The three layers are:
> 1. QueryParser
> 2. QueryNodeProcessor
> 3. QueryBuilder
> 1. The upper layer is the parsing layer which simply transforms the
> query text string into a QueryNodeTree. Currently our implementations of
> this layer use javacc.
> 2. The query node processors do most of the work. It is in fact a
> configurable chain of processors. Each processors can walk the tree and
> modify nodes or even the tree's structure. That makes it possible to
> e.g. do query optimization before the query is executed or to tokenize
> terms.
> 3. The third layer is also a configurable chain of builders, which
> transform the QueryNodeTree into Lucene Query objects.
> Furthermore the query parser uses flexible configuration objects, which
> are based on AttributeSource/Attribute. It also uses message classes that
> allow to attach resource bundles. This makes it possible to translate
> messages, which is an important feature of a query parser.
> This design allows us to develop different query syntaxes very quickly.
> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
> underlying processors and builders in a few days. We now have a 100%
> compatible Lucene query parser, which means the syntax is identical and
> all query parser test cases pass on the new one too using a wrapper.
> Recent posts show that there is demand for query syntax improvements,
> e.g improved range query syntax or operator precedence. There are
> already different QP implementations in Lucene+contrib, however I think
> we did not keep them all up to date and in sync. This is not too
> surprising, because usually when fixes and changes are made to the main
> query parser, people don't make the corresponding changes in the contrib
> parsers. (I'm guilty here too)
> With this new architecture it will be much easier to maintain different
> query syntaxes, as the actual code for the first layer is not very much.
> All syntaxes would benefit from patches and improvements we make to the
> underlying layers, which will make supporting different syntaxes much
> more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To

[jira] Created: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Luis Alves (JIRA)

New flexible query parser
-

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves


>From "New flexible query parser" thread by Micheal Busch

in my team at IBM we have used a different query parser than Lucene's in
our products for quite a while. Recently we spent a significant amount
of time in refactoring the code and designing a very generic
architecture, so that this query parser can be easily used for different
products with varying query syntaxes.

This work was originally driven by Andreas Neumann (who, however, left
our team); most of the code was written by Luis Alves, who has been a
bit active in Lucene in the past, and Adriano Campos, who joined our
team at IBM half a year ago. Adriano is Apache committer and PMC member
on the Tuscany project and getting familiar with Lucene now too.

We think this code is much more flexible and extensible than the current
Lucene query parser, and would therefore like to contribute it to
Lucene. I'd like to give a very brief architecture overview here,
Adriano and Luis can then answer more detailed questions as they're much
more familiar with the code than I am.
The goal was it to separate syntax and semantics of a query. E.g. 'a AND
b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
We distinguish the semantics of the different query components, e.g.
whether and how to tokenize/lemmatize/normalize the different terms or
which Query objects to create for the terms. We wanted to be able to
write a parser with a new syntax, while reusing the underlying
semantics, as quickly as possible.
In fact, Adriano is currently working on a 100% Lucene-syntax compatible
implementation to make it easy for people who are using Lucene's query
parser to switch.

The query parser has three layers and its core is what we call the
QueryNodeTree. It is a tree that initially represents the syntax of the
original query, e.g. for 'a AND b':
  AND
 /   \
A B

The three layers are:
1. QueryParser
2. QueryNodeProcessor
3. QueryBuilder

1. The upper layer is the parsing layer which simply transforms the
query text string into a QueryNodeTree. Currently our implementations of
this layer use javacc.
2. The query node processors do most of the work. It is in fact a
configurable chain of processors. Each processors can walk the tree and
modify nodes or even the tree's structure. That makes it possible to
e.g. do query optimization before the query is executed or to tokenize
terms.
3. The third layer is also a configurable chain of builders, which
transform the QueryNodeTree into Lucene Query objects.

Furthermore the query parser uses flexible configuration objects, which
are based on AttributeSource/Attribute. It also uses message classes that
allow to attach resource bundles. This makes it possible to translate
messages, which is an important feature of a query parser.

This design allows us to develop different query syntaxes very quickly.
Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
underlying processors and builders in a few days. We now have a 100%
compatible Lucene query parser, which means the syntax is identical and
all query parser test cases pass on the new one too using a wrapper.


Recent posts show that there is demand for query syntax improvements,
e.g improved range query syntax or operator precedence. There are
already different QP implementations in Lucene+contrib, however I think
we did not keep them all up to date and in sync. This is not too
surprising, because usually when fixes and changes are made to the main
query parser, people don't make the corresponding changes in the contrib
parsers. (I'm guilty here too)
With this new architecture it will be much easier to maintain different
query syntaxes, as the actual code for the first layer is not very much.
All syntaxes would benefit from patches and improvements we make to the
underlying layers, which will make supporting different syntaxes much
more manageable.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless

Uwe Schindler  wrote:

>> If so... maybe we could extend FieldCache's parser to allow it to
>> stop-early?  Ie it'd get the TermEnum, iterate through all the full
>> precision terms first, asking your parser to convert to long/int,
>> and then when your parser sees the very first not-full-precision
>> term, it tells FieldCache to stop.
>>
>> Would that work?
>
> Yes, good idea! In this case it is really better, that the higher
> precision terms come first. The question is how to implement that /
> extend the current API.

Maybe, to also allow extensibility for LUCENE-1372, we should let a
parser optionally just do the whole loop?

Ie, you're given an IndexReader & String field, and you return an
int[].

We could eg make an AdvancedIntParser abstract class, implementing
IntParser, and then getInts would check if the parser you passed in is
an instance of AdvancedIntParser, and would just call its getInts
method if so.

It's a bit ugly, because AdvancedIntParser would have to implement a
no-op parseInt.  But it should be back compatible.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Daniel Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683240#action_12683240
 ] 

Daniel Cheng commented on LUCENE-1490:
--

This was discovered by Chan 
http://www.cnblogs.com/jjstar/archive/2006/12/20/598016.html

> CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
> --
>
> Key: LUCENE-1490
> URL: https://issues.apache.org/jira/browse/LUCENE-1490
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Daniel Cheng
>Assignee: Michael McCandless
> Fix For: 2.4, 2.9
>
>
> CJKTokenizer have these lines..
> if (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }
> This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
> counterparts.
> Only 65281-65374 can be converted this way.
> The fix is
>  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
> && i <= 65474 && i> 65281) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler

> > Though, won't this make loading the field cache more costly since
> > you'll iterate through many more terms?
> 
> Or... do the full precision fields always order above all lower
> precision fields across all docs?

The highest precision terms have a shift value of 0. As the first char of
the encoded value is the shift, the terms are ordered by shift value first
and so the highest precision is coming first (because 0 is the smallest
shift).

> If so... maybe we could extend FieldCache's parser to allow it to
> stop-early?  Ie it'd get the TermEnum, iterate through all the full
> precision terms first, asking your parser to convert to long/int, and
> then when your parser sees the very first not-full-precision term, it
> tells FieldCache to stop.
> 
> Would that work?

Yes, good idea! In this case it is really better, that the higher precision
terms come first. The question is how to implement that / extend the current
API.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless

Michael McCandless  wrote:

> Though, won't this make loading the field cache more costly since
> you'll iterate through many more terms?

Or... do the full precision fields always order above all lower
precision fields across all docs?

If so... maybe we could extend FieldCache's parser to allow it to
stop-early?  Ie it'd get the TermEnum, iterate through all the full
precision terms first, asking your parser to convert to long/int, and
then when your parser sees the very first not-full-precision term, it
tells FieldCache to stop.

Would that work?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless



Uwe Schindler wrote:


I have no problem with it! Thanks!

What I would like to be fixed before moving it to core is the fact  
that a
additional helper field is needed for the trie values. If everything  
could
be in one field and the field is still sortable, it would be fine.  
For that,
the order of terms in the FieldCache should be fixed. As current  
trie fields

of highest precision order before all other lower precision field, the
simpliest fix would be to only index the first first term from  
TermEnum at

the documents index in the FieldCache.

Another way would be to just invert the order and let the higher  
precision
fields appear at last in the TermEnum. Both would be possible, but  
there
should be a clear statement, which term for multi-term-fields is put  
into

FieldCache (maybe configureable). See LUCENE-1372 for that.


Though, won't this make loading the field cache more costly since
you'll iterate through many more terms?

If all terms could be in one field, the API to TrieRange could be  
simplier
and more effective for the GC. The trieCodeLong/Int() method would  
just

return a TokenStream that can be indexed using "new
Field(Name,TokenStream)", more effectively using the Token's char  
buffer
during trie encoding (it could be reused). This is how it is done by  
Solr at
the moment (but with the additional allocation of the array) - I do  
not like

the array allocations for each term and the whole trie-encoding at the
moment (1x char[], 1x String[], additional copying,...).


I agree it'd be awesome to have a less GC costly translation
during indexing.

I would be happy to have it in core, I could prepare the patch, when  
the

above is fixed!


OK.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless



Uwe Schindler wrote:

I would be happy with a renaming to "NumberRangeFilter", but "trie"  
should appear somewhere in the docs.


I like this approach (and referencing the original paper); I think
it's important the javadocs give enough detail about how it works so
that one can understand the big picture and what precisionStep does,
but I think the name should more reflect how it's used rather than how
it's implemented.

I realize TrieRangeFilter can be used for anything that can be
accurately represented as int or long in Java (eg Date), but I would
expect numeric sorting/range-filtering is the vast majority of cases.

Naming is the hardest part!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler

> >> I think we should move TrieRange* into core before 2.9?
> >>
> >> It's received alot of attention, from both developers (Uwe & Yonik did
> >> lots of iterations, and Solr is folding it in) and user interest.
> >>
> >> It's a simpler & more scalable way to index numeric fields that you
> >> intend to sort and/or do range querying on; we can do away with tricky
> >> number padding.
> >>
> >> Plus it's just plain cool :)
> >>
> >> I also think we should change its name.  I know and love "trie", but
> >> it's a very technical term that's not immediately meaningful to users
> >> of Lucene's API.  Plus I've learned from doing too many renamings
> >> lately that it's best to try to get the name right at the start.
> >>
> >> Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
> >> AbstractNumberRangeFilter?
> >
> > +1
> >
> > How about NumericRangeFilter ?
> The idea behind this filter can be applied to more than just numbers,
> so I'd like to put the stress on its speed or idea used -
> FastRangeQuery, TrieRangeQuery, SegmentedRangeQuery (from the fact it
> splits input range into variable-precision segments), PrefixRangeQuery
> (you can reword the algorithm in terms of prefixes)

Trie  is also known as Prefix Tree, because of that and the usage, I called it 
TrieRange [see http://en.wikipedia.org/wiki/Trie: the original term "trie" 
comes from "retrieval." Following the etymology, the inventor, Edward Fredkin, 
pronounces it [tɹi] ("tree"). However, it is pronounced [tɹaɪ] ("try") by other 
authors].

So we have two possibilities:

- a generic name completely hiding the internals -- but then the complexity 
with the helper field should be hidden, how should "precisionStep" called and 
justified then?
- a name describing how it works, like Earwin suggested - so we could stay with 
TrieRange. 

The name "TrieRangeQuery" first appeared in [1], so it should be noted 
somewhere, even if it is renamed to NumberRangeFilter or something else... :-) 
I would be happy with a renaming to "NumberRangeFilter", but "trie" should 
appear somewhere in the docs.

Uwe

[1] Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for 
Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. 
doi:10.1016/j.cageo.2008.02.023


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: File Formats Correction

2009-03-18 Thread Michael McCandless



Indeed!  I'll fix on trunk.

Mike

Mark Miller wrote:


Just a note so I don't forget:

The file formats page says their are 4 files used for termvectors  
but their is only 3 that I can see: tvx tvd tvf.


http://lucene.apache.org/java/2_4_1/fileformats.html

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683182#action_12683182
 ] 

Michael McCandless commented on LUCENE-1435:


OK, thanks for the pointer -- I learn something new every day!

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1434.


Resolution: Fixed

Thanks Steven!

> IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
> that can be used as index terms, and vice versa
> -
>
> Key: LUCENE-1434
> URL: https://issues.apache.org/jira/browse/LUCENE-1434
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1434.patch
>
>
> Provides support for converting byte sequences to Strings that can be used as 
> index terms, and back again. The resulting Strings preserve the original byte 
> sequences' sort order (assuming the bytes are interpreted as unsigned).
> The Strings are constructed using a Base 8000h encoding of the original 
> binary data - each char of an encoded String represents a 15-bit chunk from 
> the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
> bits of char to be used without restriction; the surrogate range 
> [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
> handling to avoid them and allow use of char's high bit.
> This class is intended to serve as a mechanism to allow CollationKeys to 
> serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

File Formats Correction

2009-03-18 Thread Mark Miller


Just a note so I don't forget:

The file formats page says their are 4 files used for termvectors but 
their is only 3 that I can see: tvx tvd tvf.


http://lucene.apache.org/java/2_4_1/fileformats.html

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683174#action_12683174
 ] 

Steven Rowe commented on LUCENE-1435:
-

It's in contrib/miscellaneous/

I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be 
applied to the terms in the range query - the standard QueryParser doesn't 
analyze range terms.

From:

http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

bq. Overrides Lucene's default QueryParser so that Fuzzy-, Prefix-, Range-, and 
WildcardQuerys are also passed through the given analyzer, but wild card 
characters (like *) don't get removed from the search terms. 

This is a (test-only) cross-contrib dependency.  I'm not sure why I didn't have 
trouble with compilation - I haven't looked at this in months.  I'll take a 
look later on tonight.

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683171#action_12683171
 ] 

Michael McCandless commented on LUCENE-1434:


This looks good.  I plan to commit shortly!

> IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
> that can be used as index terms, and vice versa
> -
>
> Key: LUCENE-1434
> URL: https://issues.apache.org/jira/browse/LUCENE-1434
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1434.patch
>
>
> Provides support for converting byte sequences to Strings that can be used as 
> index terms, and back again. The resulting Strings preserve the original byte 
> sequences' sort order (assuming the bytes are interpreted as unsigned).
> The Strings are constructed using a Base 8000h encoding of the original 
> binary data - each char of an encoded String represents a 15-bit chunk from 
> the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
> bits of char to be used without restriction; the surrogate range 
> [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
> handling to avoid them and allow use of char's high bit.
> This class is intended to serve as a mechanism to allow CollationKeys to 
> serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683167#action_12683167
 ] 

Michael McCandless commented on LUCENE-1435:


Steven, I'm hitting compilation errors, eg:

{code}
[javac] 
/tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42:
 package org.apache.lucene.queryParser.analyzing does not exist
[javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser;
[javac]   ^
[javac] 
/tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89:
 cannot find symbol
{code}

What is AnalyzingQueryParser?

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler

I have no problem with it! Thanks!

What I would like to be fixed before moving it to core is the fact that a
additional helper field is needed for the trie values. If everything could
be in one field and the field is still sortable, it would be fine. For that,
the order of terms in the FieldCache should be fixed. As current trie fields
of highest precision order before all other lower precision field, the
simpliest fix would be to only index the first first term from TermEnum at
the documents index in the FieldCache.

Another way would be to just invert the order and let the higher precision
fields appear at last in the TermEnum. Both would be possible, but there
should be a clear statement, which term for multi-term-fields is put into
FieldCache (maybe configureable). See LUCENE-1372 for that.

If all terms could be in one field, the API to TrieRange could be simplier
and more effective for the GC. The trieCodeLong/Int() method would just
return a TokenStream that can be indexed using "new
Field(Name,TokenStream)", more effectively using the Token's char buffer
during trie encoding (it could be reused). This is how it is done by Solr at
the moment (but with the additional allocation of the array) - I do not like
the array allocations for each term and the whole trie-encoding at the
moment (1x char[], 1x String[], additional copying,...).

I would be happy to have it in core, I could prepare the patch, when the
above is fixed!

As names: NumberUtils, IntRangeFilter, LongRangeFilter is fine,
AbstractNumberRangeFilter is internal only (just to have less code
duplication, like StringBuffer and StringBuilder in JDK, both coming from a
internal superclass invisible to outside)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Wednesday, March 18, 2009 9:02 PM
> To: java-dev@lucene.apache.org
> Subject: move TrieRange* to core?
> 
> I think we should move TrieRange* into core before 2.9?
> 
> It's received alot of attention, from both developers (Uwe & Yonik did
> lots of iterations, and Solr is folding it in) and user interest.
> 
> It's a simpler & more scalable way to index numeric fields that you
> intend to sort and/or do range querying on; we can do away with tricky
> number padding.
> 
> Plus it's just plain cool :)
> 
> I also think we should change its name.  I know and love "trie", but
> it's a very technical term that's not immediately meaningful to users
> of Lucene's API.  Plus I've learned from doing too many renamings
> lately that it's best to try to get the name right at the start.
> 
> Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
> AbstractNumberRangeFilter?
> 
> Thoughts?
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1434:
--

Assignee: Michael McCandless

> IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
> that can be used as index terms, and vice versa
> -
>
> Key: LUCENE-1434
> URL: https://issues.apache.org/jira/browse/LUCENE-1434
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1434.patch
>
>
> Provides support for converting byte sequences to Strings that can be used as 
> index terms, and back again. The resulting Strings preserve the original byte 
> sequences' sort order (assuming the bytes are interpreted as unsigned).
> The Strings are constructed using a Base 8000h encoding of the original 
> binary data - each char of an encoded String represents a 15-bit chunk from 
> the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
> bits of char to be used without restriction; the surrogate range 
> [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
> handling to avoid them and allow use of char's high bit.
> This class is intended to serve as a mechanism to allow CollationKeys to 
> serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683155#action_12683155
 ] 

Michael McCandless commented on LUCENE-1435:


I think we should commit this to contrib/collation as an "external" way to get 
faster range filters on fields that require custom Collator; at some future 
point we can consider allowing a given field to sort its terms in some custom 
way.

Marvin: does KS/Lucy give control over sort order of the terms in a field?

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1435:
--

Assignee: Michael McCandless

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Earwin Burrfoot

On Wed, Mar 18, 2009 at 23:08, Andi Vajda  wrote:
>
> On Mar 18, 2009, at 13:01, Michael McCandless 
> wrote:
>
>> I think we should move TrieRange* into core before 2.9?
>>
>> It's received alot of attention, from both developers (Uwe & Yonik did
>> lots of iterations, and Solr is folding it in) and user interest.
>>
>> It's a simpler & more scalable way to index numeric fields that you
>> intend to sort and/or do range querying on; we can do away with tricky
>> number padding.
>>
>> Plus it's just plain cool :)
>>
>> I also think we should change its name.  I know and love "trie", but
>> it's a very technical term that's not immediately meaningful to users
>> of Lucene's API.  Plus I've learned from doing too many renamings
>> lately that it's best to try to get the name right at the start.
>>
>> Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
>> AbstractNumberRangeFilter?
>
> +1
>
> How about NumericRangeFilter ?
The idea behind this filter can be applied to more than just numbers,
so I'd like to put the stress on its speed or idea used -
FastRangeQuery, TrieRangeQuery, SegmentedRangeQuery (from the fact it
splits input range into variable-precision segments), PrefixRangeQuery
(you can reword the algorithm in terms of prefixes)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1496) Move solr NumberUtils to lucene

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683149#action_12683149
 ] 

Michael McCandless commented on LUCENE-1496:


If we move trie/* into core, what do we need/want to fold in from Solr's 
NumberUtils?

> Move solr NumberUtils to lucene
> ---
>
> Key: LUCENE-1496
> URL: https://issues.apache.org/jira/browse/LUCENE-1496
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Ryan McKinley
>Priority: Trivial
> Fix For: 2.9
>
>
> solr includes a NumberUtils class with some general utilities for dealing 
> with tokens and numbers.
> This should be in lucene rather then solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: move TrieRange* to core?

2009-03-18 Thread Andi Vajda



On Mar 18, 2009, at 13:01, Michael McCandless  
 wrote:



I think we should move TrieRange* into core before 2.9?

It's received alot of attention, from both developers (Uwe & Yonik did
lots of iterations, and Solr is folding it in) and user interest.

It's a simpler & more scalable way to index numeric fields that you
intend to sort and/or do range querying on; we can do away with tricky
number padding.

Plus it's just plain cool :)

I also think we should change its name.  I know and love "trie", but
it's a very technical term that's not immediately meaningful to users
of Lucene's API.  Plus I've learned from doing too many renamings
lately that it's best to try to get the name right at the start.

Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
AbstractNumberRangeFilter?


+1

How about NumericRangeFilter ?

Andi..




Thoughts?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-652) Compressed fields should be "externalized" (from Fields into Document)

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-652:
--

Attachment: LUCENE-652.patch

I added o.a.l.document.CompressionTools, with static methods to
compress & decompress, and deprecated Field.Store.COMPRESS.

I also found two separate bugs:

  * With Field.Store.COMPRESS we were running compression twice
(unnecessarily); I've fixed that.

  * If you try to make a Field(byte[], int offset, int length,
Store.COMPRESS), you'll hit an AIOOBE.  I think we don't need to
fix this one since it's in now-deprecated code, and with 2.9,
users can migrate to CompressionTools.

I plan to commit in a day or two.


> Compressed fields should be "externalized" (from Fields into Document)
> --
>
> Key: LUCENE-652
> URL: https://issues.apache.org/jira/browse/LUCENE-652
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-652.patch
>
>
> Right now, as of 2.0 release, Lucene supports compressed stored fields.  
> However, after discussion on java-dev, the suggestion arose, from Robert 
> Engels, that it would be better if this logic were moved into the Document 
> level.  This way the indexing level just stores opaque binary fields, and 
> then Document handles compress/uncompressing as needed.
> This approach would have prevented issues like LUCENE-629 because merging of 
> segments would never need to decompress.
> See this thread for the recent discussion:
> http://www.gossamer-threads.com/lists/lucene/java-dev/38836
> When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

move TrieRange* to core?

2009-03-18 Thread Michael McCandless


I think we should move TrieRange* into core before 2.9?

It's received alot of attention, from both developers (Uwe & Yonik did
lots of iterations, and Solr is folding it in) and user interest.

It's a simpler & more scalable way to index numeric fields that you
intend to sort and/or do range querying on; we can do away with tricky
number padding.

Plus it's just plain cool :)

I also think we should change its name.  I know and love "trie", but
it's a very technical term that's not immediately meaningful to users
of Lucene's API.  Plus I've learned from doing too many renamings
lately that it's best to try to get the name right at the start.

Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
AbstractNumberRangeFilter?

Thoughts?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-652) Compressed fields should be "externalized" (from Fields into Document)

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-652:
-

Assignee: Michael McCandless

> Compressed fields should be "externalized" (from Fields into Document)
> --
>
> Key: LUCENE-652
> URL: https://issues.apache.org/jira/browse/LUCENE-652
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> Right now, as of 2.0 release, Lucene supports compressed stored fields.  
> However, after discussion on java-dev, the suggestion arose, from Robert 
> Engels, that it would be better if this logic were moved into the Document 
> level.  This way the indexing level just stores opaque binary fields, and 
> then Document handles compress/uncompressing as needed.
> This approach would have prevented issues like LUCENE-629 because merging of 
> segments would never need to decompress.
> See this thread for the recent discussion:
> http://www.gossamer-threads.com/lists/lucene/java-dev/38836
> When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: GSoC 09 project ideas...

2009-03-18 Thread Grant Ingersoll



On Mar 18, 2009, at 12:04 PM, Zaid Md. Abdul Wahab Sheikh wrote:


Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there  
are no project ideas for Lucene proper. (Only ideas for Mahout  
listed).


This requires someone (has to be a committer) willing to mentor.  I'd  
love to see a Lucene GSOC project, but I'm already mentoring on Mahout  
and don't have time for more than one.


Please put up some ideas for Lucene there or please mention some  
popular open issues that might be suitable as a GSoC project.


As for ideas, what the others said would be good, I'd also add in:
Design/implement the query side of the new TokenStream Attribute stuff  
so that we are closer to flexible indexing.


New/updated demo would be great, one that shows off more of Lucene.

-Grant

[jira] Updated: (LUCENE-1533) Deleted documents as a Filter or top level Query

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1533:
---

Fix Version/s: (was: 2.9)

Clearing fix version.

> Deleted documents as a Filter or top level Query
> 
>
> Key: LUCENE-1533
> URL: https://issues.apache.org/jira/browse/LUCENE-1533
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In exploring alternative and perhaps faster ways to implement the
> deleted documents functionality, the idea of filtering the deleted
> documents at a higher level came up. This system would save on
> checking the deleted docs BitVector of each doc read from the posting
> list by SegmentTermDocs. This is equivalent to an AND NOT deleted
> docs query.
> If the patch improves the speed of indexes with delete documents,
> many core unit tests will need to change, or alternatively the
> functionality provided by this patch can be an IndexReader option.
> I'm thinking the first implementation will be a Filter in
> IndexSearcher. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1526) Tombstone deletions in IndexReader

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1526:
---

Fix Version/s: (was: 2.9)

I don't think we should block 2.9 for this.

> Tombstone deletions in IndexReader
> --
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Make TermScorer non final

2009-03-18 Thread Grant Ingersoll



On Mar 18, 2009, at 7:57 AM, Michael McCandless wrote:



Coming from the discussions in LUCENE-1522 (improving highlighter), I
think at some point we should merge Span*Query into their normal
counterparts, if possible.

Ie, there should be only one TermQuery that can do both what the
current TermQuery does, and also what SpanTermQuery does.  It's able
to enumerate the spans/payloads for a given document, and if you don't
request those, the performance should hopefully be equal to that of
the current TermQuery.

The highligher would in fact request spans for a "normal" TermQuery,
on a single doc index at a time, in order to locate the hits.

Likewise for SpanOrQuery, SpanAndQuery.

I have no real sense of how much work this is, what problems would
ensue (eg possible difference in scoring, etc.), but from
highlighter's standpoint, ideally all queries need to be able to
enumerate the collection of positions that established the match.


Maybe they should all implement a common Interface that provides  
highlighting info?  I don't know what it would be, but it seems easier  
to do that then to merge them all, but I'm not sure.  Not that I  
wouldn't want to see a simpler query system.   There's some cool  
things you can do w/ spans, but they still have some fundamental flaws  
that make them annoying.  Namely, often times one of the reasons you  
want Spans is b/c you care about what is going on around the match,  
i.e. co-occurrence data, yet it is still annoying/difficult to get  
that information w/o pivoting around either term vectors or re  
analyzing the document.  With the new Attribute stuff, however, it  
might be getting a little easier, as one could now store offset  
information at the term level (which you can do w/ payloads, too) and  
then use that to index into the original String.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1490.


Resolution: Fixed

Thanks Daniel!

> CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
> --
>
> Key: LUCENE-1490
> URL: https://issues.apache.org/jira/browse/LUCENE-1490
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Daniel Cheng
>Assignee: Michael McCandless
> Fix For: 2.9, 2.4
>
>
> CJKTokenizer have these lines..
> if (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }
> This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
> counterparts.
> Only 65281-65374 can be converted this way.
> The fix is
>  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
> && i <= 65474 && i> 65281) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1490:
--

Assignee: Michael McCandless

> CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
> --
>
> Key: LUCENE-1490
> URL: https://issues.apache.org/jira/browse/LUCENE-1490
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Daniel Cheng
>Assignee: Michael McCandless
> Fix For: 2.4, 2.9
>
>
> CJKTokenizer have these lines..
> if (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }
> This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
> counterparts.
> Only 65281-65374 can be converted this way.
> The fix is
>  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
> && i <= 65474 && i> 65281) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1561:
--

Assignee: Michael McCandless

> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1561:
---

Attachment: LUCENE-1561.patch

Attached patch.  I renamed to "omitTermFreqAndPositions", and added a NOTE to 
the javadoc about positional queries silently not working when you use this 
option.  I plan to commit in a day or so.

> Maybe rename Field.omitTf, and strengthen the javadocs
> --
>
> Key: LUCENE-1561
> URL: https://issues.apache.org/jira/browse/LUCENE-1561
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1561.patch
>
>
> Spinoff from here:
>   
> http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
> Maybe rename omitTf to something like omitTermPositions, and make it clear 
> what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread David Kaelbling (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683079#action_12683079
 ] 

David Kaelbling commented on LUCENE-1522:
-

Hi,

Our application wants to find and highlight all the hits in a document,
not just the best one(s).  If future highlighters still allowed this,
even if only by judicious use of subclasses, I would be happy :-)

Thanks,
David

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

dkaelbl...@blackducksoftware.com
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com




> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1472:
---

Fix Version/s: (was: 2.9)

Removing 2.9 target.

> DateTools.stringToDate() can cause lock contention under load
> -
>
> Key: LUCENE-1472
> URL: https://issues.apache.org/jira/browse/LUCENE-1472
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.2
>Reporter: Mark Lassau
>Priority: Minor
>
> Load testing our application (the JIRA Issue Tracker) has shown that threads 
> spend a lot of time blocked in DateTools.stringToDate().
> The stringToDate() method uses a singleton SimpleDateFormat object to parse 
> the dates.
> Each call to SimpleDateFormat.parse() is *synchronized* because 
> SimpleDateFormat is not thread safe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1145.


Resolution: Fixed

Thanks Eks and Paul!

> DisjunctionSumScorer small tweak
> 
>
> Key: LUCENE-1145
> URL: https://issues.apache.org/jira/browse/LUCENE-1145
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
> Environment: all
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: DisjunctionSumScorerOptimization.patch, 
> DSSQueueSizeOptimization.patch, TestScorerPerformance.java
>
>
> Move ScorerDocQueue initialization from next() and skipTo() methods to the 
> Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
> tests). 
> Downside (if this is one, I cannot judge) would be throwing IOException from 
> DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
> problem as this IOException does not propagate too far (the only modification 
> I made is in BooleanScorer2)
> if (scorerDocQueue == null) {
>   initScorerDocQueue();
> }
>  
> Attached test is just quick & dirty rip of  TestScorerPerf from standard 
> Lucene test package. Not included as patch as I do not like it.
> All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: GSoC 09 project ideas...

2009-03-18 Thread Michael McCandless



I think creating a better Highlighter for Lucene, which is actively
being discussed:

https://issues.apache.org/jira/browse/LUCENE-1522

would make a good GSoC project, but I don't think I have time to mentor.

Realtime search is currently in progress already, being tracked/iterated
here:

https://issues.apache.org/jira/browse/LUCENE-1516

The original Ocean (LUCENE-1313) that you found was a more ambitious
approach, which after discussions here eventually lead to the simpler
approach in LUCENE-1516.

Mike

Abdul Wahab Sheikh wrote:


Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there  
are no project ideas for Lucene proper. (Only ideas for Mahout  
listed). Please put up some ideas for Lucene there or please mention  
some popular open issues that might be suitable as a GSoC project.
I would very much like to work on Lucene during Summer of Code 09. I  
am currently researching/doing a project on "Realtime search".
It seems, a contrib exists for realtime search in Lucene. http://issues.apache.org/jira/browse/LUCENE-1313 
. Can anyone give me an update on its status? Is that sufficient/ 
complete, or should I start investigating possibilities of  
integrating 'realtime' search in Lucene. Please comment.


Z.S.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: GSoC 09 project ideas...

2009-03-18 Thread Jason Rutherglen

Hi Z.S.,

I'll update LUCENE-1313 after LUCENE-1516 is committed.  I can post the
basic new patch I have for LUCENE-1313 (heavily simplified compared to the
previous patches), however it will assume LUCENE-1516.  The other area that
will need to be addressed is standard benchmarking for different realtime
search approaches as we don't know what will be best yet.

What areas in regard to realtime search are you working on?

-J

On Wed, Mar 18, 2009 at 9:04 AM, Zaid Md. Abdul Wahab Sheikh <
sheikh.z...@gmail.com> wrote:

> Hi lucene,
> In this link http://wiki.apache.org/general/SummerOfCode2009 , there are
> no project ideas for Lucene proper. (Only ideas for Mahout listed). Please
> put up some ideas for Lucene there or please mention some popular open
> issues that might be suitable as a GSoC project.
> I would very much like to work on Lucene during Summer of Code 09. I am
> currently researching/doing a project on "Realtime search".
> It seems, a contrib exists for realtime search in Lucene.
> http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an
> update on its status? Is that sufficient/complete, or should I start
> investigating possibilities of integrating 'realtime' search in Lucene.
> Please comment.
>
> Z.S.
>

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683064#action_12683064
 ] 

Marvin Humphrey commented on LUCENE-1522:
-

> OK, it sounds like one can simply use different models to score
> fragdocs and it's still an open debate how much each of these criteria
> (IDF, showing surround context, being on sentence boundary, diversity
> of terms) should impact the score. 

With Michael Busch's priority queue approach, the algorithm for choosing the
fragments can be abstracted into the class of object we put in the queue and
its lessThan() method.  The output from the queue just has to be something the
Highlighter can chew.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683058#action_12683058
 ] 

Michael McCandless commented on LUCENE-1145:


I plan to commit shortly.

> DisjunctionSumScorer small tweak
> 
>
> Key: LUCENE-1145
> URL: https://issues.apache.org/jira/browse/LUCENE-1145
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
> Environment: all
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: DisjunctionSumScorerOptimization.patch, 
> DSSQueueSizeOptimization.patch, TestScorerPerformance.java
>
>
> Move ScorerDocQueue initialization from next() and skipTo() methods to the 
> Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
> tests). 
> Downside (if this is one, I cannot judge) would be throwing IOException from 
> DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
> problem as this IOException does not propagate too far (the only modification 
> I made is in BooleanScorer2)
> if (scorerDocQueue == null) {
>   initScorerDocQueue();
> }
>  
> Attached test is just quick & dirty rip of  TestScorerPerf from standard 
> Lucene test package. Not included as patch as I do not like it.
> All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

GSoC 09 project ideas...

2009-03-18 Thread Zaid Md. Abdul Wahab Sheikh

Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there are no
project ideas for Lucene proper. (Only ideas for Mahout listed). Please put
up some ideas for Lucene there or please mention some popular open issues
that might be suitable as a GSoC project.
I would very much like to work on Lucene during Summer of Code 09. I am
currently researching/doing a project on "Realtime search".
It seems, a contrib exists for realtime search in Lucene.
http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an
update on its status? Is that sufficient/complete, or should I start
investigating possibilities of integrating 'realtime' search in Lucene.
Please comment.

Z.S.

[jira] Assigned: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1145:
--

Assignee: Michael McCandless

> DisjunctionSumScorer small tweak
> 
>
> Key: LUCENE-1145
> URL: https://issues.apache.org/jira/browse/LUCENE-1145
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
> Environment: all
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: DisjunctionSumScorerOptimization.patch, 
> DSSQueueSizeOptimization.patch, TestScorerPerformance.java
>
>
> Move ScorerDocQueue initialization from next() and skipTo() methods to the 
> Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
> tests). 
> Downside (if this is one, I cannot judge) would be throwing IOException from 
> DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
> problem as this IOException does not propagate too far (the only modification 
> I made is in BooleanScorer2)
> if (scorerDocQueue == null) {
>   initScorerDocQueue();
> }
>  
> Attached test is just quick & dirty rip of  TestScorerPerf from standard 
> Lucene test package. Not included as patch as I do not like it.
> All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683053#action_12683053
 ] 

Michael McCandless commented on LUCENE-1522:



{quote}
Something like that. An array of span scores is too limited; a full fledged
class would do better. Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.
{quote}

Agreed, and I'm not sure about the tree structure (just floating
ideas...).  It could very well be overkill.

{quote}
By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.
{quote}

It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself
will have returned that.  Ie we would add a method to
"getSpanTree()".

{quote}
Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So "bush bush" might be prefered over "president bush", 
but "bush or bush" proabably wouldn't.
{quote}

OK, it sounds like one can simply use different models to score
fragdocs and it's still an open debate how much each of these criteria
(IDF, showing surround context, being on sentence boundary, diversity
of terms) should impact the score.  I agree, the "basic litmus test" I
proposed is too strong.

{quote}
bq. Lucene H1. Too many elipses, and fragments don't prefer to start on 
sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic
implementations we currently supply for the pluggable classes. You can
supply a custom fragmenter and you can control the number of
fragments.
{quote}

I agree: H1 is very pluggable and one could plug in a better
fragmenter, but we don't offer such an impl in H1, and this is a case
where "out-of-the-box defaults" are very important.


> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683032#action_12683032
 ] 

Mark Miller commented on LUCENE-1522:
-

bq. Lucene H1. Too many elipses, and fragments don't prefer to start on 
sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic 
implementations we currently supply for the pluggable classes. You can supply a 
custom fragmenter and you can control the number of fragments.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

2009-03-18 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-1550:
---

Assignee: Grant Ingersoll

> Add N-Gram String Matching for Spell Checking
> -
>
> Key: LUCENE-1550
> URL: https://issues.apache.org/jira/browse/LUCENE-1550
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Affects Versions: 2.9
>Reporter: Thomas Morton
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram 
> similarity and distance". Proceedings of the Twelfth International Conference 
> on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  
> Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683030#action_12683030
 ] 

Marvin Humphrey commented on LUCENE-1522:
-

> I think we may need a tree-structured result returned by the
> Weight/Scorer, compactly representing the "space" of valid fragdocs
> for this one doc. And then somehow we walk that tree,
> enumerating/scoring individual "valid" fragdocs that are created from
> that tree.

Something like that.  An array of span scores is too limited; a full fledged
class would do better.  Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.  By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.  

>> However, note that IDF would prevent a bunch of hits on "the" from causing 
>> too
>> hot a hotspot in the heat map. So you're likely to see fragments with high
>> discriminatory value.
> 
> This still seems subjectively wrong to me. If I search for "president
> bush", probably bush is the rarer term and so you would favor showing
> me a single fragment that had bush occur twice, over a fragment that
> had a single occurrence of president and bush?

We've ended up in a false dichotomy.  Favoring high IDF terms -- or more
accurately, high scoring character position spans -- and favoring fragments 
with high term diversity are not mutually exclusive.  

Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So "bush bush" might be prefered over "president bush", 
but "bush or bush" proabably wouldn't.

I don't think that there's anything wrong with preferring high term diversity;
the KS highlighter doesn't happen to support favoring fragments with high term
diversity now, but would be improved by adding that capability.  I just don't
think term diversity is so important that it qualifies as a "base litmus
test".

There are other ways of choosing good fragments, and IDF is one of them.  If
you want to show why a doc matched a query, it makes sense to show the section
of the document that contributed most to the score, surrounded by a little
context.  

> Which excerpts don't scan easily right now? Google's, KS's, Lucene's
> H1 or H2?

Lucene H1.  Too many elipses, and fragments don't prefer to start on sentence
boundaries.  

I have to qualify the assertion that the fragments don't scan well with the 
caveat 
that I'm basing this on a personal impression.  However, I'm pretty confident 
about that impression.  I would be stunned if there were not studies out there
demonstrating that sentence fragments which begin at the top are easier to
consume than sentence fragments which begin in the middle.

> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable F

Re: Make TermScorer non final

2009-03-18 Thread Simon Willnauer

On Wed, Mar 18, 2009 at 1:32 PM, Mark Miller  wrote:
>>> In some usecases this could be important especially where the power of
>>> a span query is not required.
>
> I think the power of a spanquery is required for payloads though - the term
> query will not hit each position to do payload loading - there is no need
> for termquery to enumerate positions. Right?
No you are right, term query does not need to enumerate the TermPositions.
This doesn't mean that it can not look at them. Issue
https://issues.apache.org/jira/browse/LUCENE-1017 has apparently done
some measurements without a significant performance improvement. I
didn't expect an large improvement anyway but without the knowledge of
this issue it was worth to look at it.

One thing I wanna mention aside: As long as TermScorer is final there
is no problem with the implementation beside some redundant code. The
TermScorer does not use the float score() method to calculate the
score in score(HitCollector, int). It rather duplicates the code for
performance reasons I assume. I have cleaned up this code a little bit
as I was going to implement payloads using this class. If it is
desirable to have this code cleaned up I can submit a patch.

Thanks,

simon
>
>
>
>
> Simon Willnauer wrote:
>>
>> Nothing different, I'm just concerned about the performance as the
>> SpanQuerys take about twice as long as a term query.
>> I run a little benchmark and found BoostingTermQuery being 1.5 times
>> slower than TermQuery without any payloads in the index.
>> In some usecases this could be important especially where the power of
>> a span query is not required.
>>
>> Maybe I miss something, if so please let me know.
>>
>> simon
>> On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll 
>> wrote:
>>
>>>
>>> What does PayloadTermQuery do that BoostingTermQuery doesn't do?
>>>
>>> -Grant
>>>
>>> On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:
>>>
>>>

 Hi, I looked at TermScorer today in order implement a TermQuery to
 utilize Payloads from the index.
 I realized that this class is final in the current trunk. It's kind of
 obvious that is is declared final for optimization purposes.
 I wanna know if it is possible to make it non final in the next
 release or later to use it in a PayloadTermQuery class.
 I would like to reuse this code and do some additional cleanups like
 remove the code redundancy in score() / score(HitCollector, int).

 Thanks,
 Simon

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Make TermScorer non final

2009-03-18 Thread Mark Miller


In some usecases this could be important especially where the power of
a span query is not required.


I think the power of a spanquery is required for payloads though - the term 
query will not hit each position to do payload loading - there is no need for 
termquery to enumerate positions. Right?




Simon Willnauer wrote:

Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll  wrote:
  

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:



Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's kind of
obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups like
remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Make TermScorer non final

2009-03-18 Thread Michael McCandless



Coming from the discussions in LUCENE-1522 (improving highlighter), I
think at some point we should merge Span*Query into their normal
counterparts, if possible.

Ie, there should be only one TermQuery that can do both what the
current TermQuery does, and also what SpanTermQuery does.  It's able
to enumerate the spans/payloads for a given document, and if you don't
request those, the performance should hopefully be equal to that of
the current TermQuery.

The highligher would in fact request spans for a "normal" TermQuery,
on a single doc index at a time, in order to locate the hits.

Likewise for SpanOrQuery, SpanAndQuery.

I have no real sense of how much work this is, what problems would
ensue (eg possible difference in scoring, etc.), but from
highlighter's standpoint, ideally all queries need to be able to
enumerate the collection of positions that established the match.

Mike

Grant Ingersoll wrote:

See https://issues.apache.org/jira/browse/LUCENE-1017 for some  
background.  Have you measured BTQ versus the SpanTermQuery?   
Position based stuff is often slower.


SpanQueries could use some performance assessments, that is for  
sure.  Ideally, I think you should compare:

TermQuery v. SpanTQ v. BTQ

-Grant


On Mar 18, 2009, at 5:43 AM, Simon Willnauer wrote:


Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power  
of

a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll > wrote:

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:


Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's  
kind of

obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups  
like

remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682987#action_12682987
 ] 

Michael McCandless commented on LUCENE-1522:



OK to sum up here with observations / wish list / ideas /
controversies / etc. for Lucene's future merged highlighter:

  * Fragmenter should aim for fast "eye + brain scanning
consumability" (eg, try hard to start on sentence boundaries,
include context)

  * Let's try for single source -- each Query/Weight/Scorer should be
able to enumerate the set of term positions/spans that caused it
to match a specific doc (like explain(), but provides
positions/spans detailing the match).  Trying to "reverse
engineer" the matching is brittle

  * Sliding window is better than static "top down" fragmentation

  * To scale, we should make a simple IndexReader impl on top of term
vectors, but still allow the "re-index single doc on the fly"
option

  * Favoring breadth (more unique terms instead of many occurences of
certain terms) seems important, except for too-many-term queries
where this gets unwieldy

  * Prefer a single fragment if it scores well enough, but fall back
to several, if necessary, to show "breadth"

  * Produce structured output so non-HTML front ends (eg Flex) can
render

  * Try to include "context around the hits", when possible (eg the
"favor middle of hte sentence" that Michael described)

  * Maybe or maybe don't let IDF affect fragment scoring

  * Performance is important -- use TermVectors if present, add early
termination if you've already found a good enough fragdoc, etc.

  * Maybe a tree-based fragdoc enumeration / searching model; I think
this'd be even more efficient than sliding window, especially for
large docs

  * Multi-color, HeatMap default ootb HTML UIs are nice

  * It's all very subjective and quite a good challenge!!

In the meantime, it seems like we should commit this H2 and give users
the choice?  We can then iterate over time on our wish list


> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682985#action_12682985
 ] 

Michael McCandless commented on LUCENE-1522:



{quote}
>> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
>> spans produced by their children.
> 
> Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the 
> fact that you should try to include a match from each of the sub-clauses' 
> spans.

A good idea. ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account. That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.
{quote}

I think we may need a tree-structured result returned by the
Weight/Scorer, compactly representing the "space" of valid fragdocs
for this one doc.  And then somehow we walk that tree,
enumerating/scoring individual "valid" fragdocs that are created from
that tree.

{quote}
> What I meant was: all other things being equal, do you more strongly
> favor a fragment that has all N of the terms in a query vs another
> fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in. The span 
objects only tell the Highlighter that a particular range of characters 
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on "the" from causing too
hot a hotspot in the heat map. So you're likely to see fragments with high
discriminatory value.
{quote}

This still seems subjectively wrong to me.  If I search for "president
bush", probably bush is the rarer term and so you would favor showing
me a single fragment that had bush occur twice, over a fragment that
had a single occurrence of president and bush?

{quote}
> Google picks more than one fragment; it seems like it picks one or two
> fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment. It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.

What bothers me is that the excerpts don't scan easily right now. I consider
that a much more important defect than the fact that the fragdoc doesn't hit 
every term (which isn't even possible for large queries), and it seemed to me 
that pursuing exhaustive term matching was likely to yield even more highly 
fragmented, visually chaotic fragdocs.
{quote}

Which excerpts don't scan easily right now?  Google's, KS's, Lucene's
H1 or H2?

I think with a tree structure representing the search space for all
fragdocs, we could then efficiently enumerate fradocs with an
appropriate scoring model (favoring sentence starts or surrounding
context, breadth of terms, etc.).  This way we can do a real search
(on all fragdocs) subject to the preference for
consumability/breadth.


> another highlighter
> ---
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
> for( String fragment : fragments )
>   System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> w1 w2
> ---
> q="w1 w2"~1
> w1 w3 w2 w3 w1 w2
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch d

Re: Make TermScorer non final

2009-03-18 Thread Grant Ingersoll

See https://issues.apache.org/jira/browse/LUCENE-1017 for some  
background.  Have you measured BTQ versus the SpanTermQuery?  Position  
based stuff is often slower.


SpanQueries could use some performance assessments, that is for sure.   
Ideally, I think you should compare:

TermQuery v. SpanTQ v. BTQ

-Grant


On Mar 18, 2009, at 5:43 AM, Simon Willnauer wrote:


Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll  
 wrote:

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:


Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's  
kind of

obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups like
remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: IndexWriter.rollback() logic

2009-03-18 Thread Michael McCandless



Also, rollback is still possible after a commit as long as you're using
a deletion policy that keeps more than one commit around, by
opening the IndexWriter on a prior commit point.

Mike

Nadav Har'El wrote:

On Mon, Feb 23, 2009, Jason Rutherglen wrote about "Re:  
IndexWriter.rollback() logic":

Howdy An,

Commit means the changes are committed, there's no rollback at that  
point.


Also in the futuer please post your questions to java-dev@lucene.apache.org


Actually, An does make a good point that need to be corrected (by  
developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's  
javadoc says


 Close the IndexWriter without committing any of the changes that have
 occurred since it was opened. This removes any temporary files that  
had
 been created, after which the state of the index will be the same  
as it

 was when this writer was first opened.

But, this isn't exactly true - it doesn't always revert to the state  
of the
open(), but rather to the last commit() if such was done. For most  
intents

and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't  
know that.


--
Nadav Har'El| Wednesday, Mar 18 2009, 22  
Adar 5769
IBM Haifa Research Lab   
|-
   |Hi! I'm a signature virus! Copy  
me into

http://nadav.harel.org.il   |your signature to help me spread!

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: IndexWriter.rollback() logic

2009-03-18 Thread Michael McCandless



Nadav Har'El wrote:

On Mon, Feb 23, 2009, Jason Rutherglen wrote about "Re:  
IndexWriter.rollback() logic":

Howdy An,

Commit means the changes are committed, there's no rollback at that  
point.


Also in the futuer please post your questions to java-dev@lucene.apache.org


Actually, An does make a good point that need to be corrected (by  
developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's  
javadoc says


 Close the IndexWriter without committing any of the changes that have
 occurred since it was opened. This removes any temporary files that  
had
 been created, after which the state of the index will be the same  
as it

 was when this writer was first opened.

But, this isn't exactly true - it doesn't always revert to the state  
of the
open(), but rather to the last commit() if such was done. For most  
intents

and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't  
know that.



Thanks Nadav; I'll fix the javadocs.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Make TermScorer non final

2009-03-18 Thread Simon Willnauer

Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll  wrote:
> What does PayloadTermQuery do that BoostingTermQuery doesn't do?
>
> -Grant
>
> On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:
>
>> Hi, I looked at TermScorer today in order implement a TermQuery to
>> utilize Payloads from the index.
>> I realized that this class is final in the current trunk. It's kind of
>> obvious that is is declared final for optimization purposes.
>> I wanna know if it is possible to make it non final in the next
>> release or later to use it in a PayloadTermQuery class.
>> I would like to reuse this code and do some additional cleanups like
>> remove the code redundancy in score() / score(HitCollector, int).
>>
>> Thanks,
>> Simon
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: IndexWriter.rollback() logic

2009-03-18 Thread Nadav Har'El

On Mon, Feb 23, 2009, Jason Rutherglen wrote about "Re: IndexWriter.rollback() 
logic":
> Howdy An,
> 
> Commit means the changes are committed, there's no rollback at that point.
> 
> Also in the futuer please post your questions to java-dev@lucene.apache.org

Actually, An does make a good point that need to be corrected (by developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's javadoc says

  Close the IndexWriter without committing any of the changes that have
  occurred since it was opened. This removes any temporary files that had
  been created, after which the state of the index will be the same as it
  was when this writer was first opened. 

But, this isn't exactly true - it doesn't always revert to the state of the
open(), but rather to the last commit() if such was done. For most intents
and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't know that.

-- 
Nadav Har'El| Wednesday, Mar 18 2009, 22 Adar 5769
IBM Haifa Research Lab  |-
|Hi! I'm a signature virus! Copy me into
http://nadav.harel.org.il   |your signature to help me spread!

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

59 matches

Mail list logo