Re: IndexWriter.rollback() logic

2009-03-18 Thread Nadav Har'El
On Mon, Feb 23, 2009, Jason Rutherglen wrote about Re: IndexWriter.rollback() 
logic:
 Howdy An,
 
 Commit means the changes are committed, there's no rollback at that point.
 
 Also in the futuer please post your questions to java-dev@lucene.apache.org

Actually, An does make a good point that need to be corrected (by developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's javadoc says

  Close the IndexWriter without committing any of the changes that have
  occurred since it was opened. This removes any temporary files that had
  been created, after which the state of the index will be the same as it
  was when this writer was first opened. 

But, this isn't exactly true - it doesn't always revert to the state of the
open(), but rather to the last commit() if such was done. For most intents
and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't know that.

-- 
Nadav Har'El| Wednesday, Mar 18 2009, 22 Adar 5769
IBM Haifa Research Lab  |-
|Hi! I'm a signature virus! Copy me into
http://nadav.harel.org.il   |your signature to help me spread!

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Make TermScorer non final

2009-03-18 Thread Simon Willnauer
Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll gsing...@apache.org wrote:
 What does PayloadTermQuery do that BoostingTermQuery doesn't do?

 -Grant

 On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:

 Hi, I looked at TermScorer today in order implement a TermQuery to
 utilize Payloads from the index.
 I realized that this class is final in the current trunk. It's kind of
 obvious that is is declared final for optimization purposes.
 I wanna know if it is possible to make it non final in the next
 release or later to use it in a PayloadTermQuery class.
 I would like to reuse this code and do some additional cleanups like
 remove the code redundancy in score() / score(HitCollector, int).

 Thanks,
 Simon

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.rollback() logic

2009-03-18 Thread Michael McCandless


Nadav Har'El wrote:

On Mon, Feb 23, 2009, Jason Rutherglen wrote about Re:  
IndexWriter.rollback() logic:

Howdy An,

Commit means the changes are committed, there's no rollback at that  
point.


Also in the futuer please post your questions to java-dev@lucene.apache.org


Actually, An does make a good point that need to be corrected (by  
developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's  
javadoc says


 Close the IndexWriter without committing any of the changes that have
 occurred since it was opened. This removes any temporary files that  
had
 been created, after which the state of the index will be the same  
as it

 was when this writer was first opened.

But, this isn't exactly true - it doesn't always revert to the state  
of the
open(), but rather to the last commit() if such was done. For most  
intents

and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't  
know that.



Thanks Nadav; I'll fix the javadocs.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.rollback() logic

2009-03-18 Thread Michael McCandless


Also, rollback is still possible after a commit as long as you're using
a deletion policy that keeps more than one commit around, by
opening the IndexWriter on a prior commit point.

Mike

Nadav Har'El wrote:

On Mon, Feb 23, 2009, Jason Rutherglen wrote about Re:  
IndexWriter.rollback() logic:

Howdy An,

Commit means the changes are committed, there's no rollback at that  
point.


Also in the futuer please post your questions to java-dev@lucene.apache.org


Actually, An does make a good point that need to be corrected (by  
developers,
not by users ;-)) - the javadoc is a bit misleading. rollback's  
javadoc says


 Close the IndexWriter without committing any of the changes that have
 occurred since it was opened. This removes any temporary files that  
had
 been created, after which the state of the index will be the same  
as it

 was when this writer was first opened.

But, this isn't exactly true - it doesn't always revert to the state  
of the
open(), but rather to the last commit() if such was done. For most  
intents

and purposes (including this one), commit() is equivalent to a close()
followed by a new open(), but a person reading this javadoc wouldn't  
know that.


--
Nadav Har'El| Wednesday, Mar 18 2009, 22  
Adar 5769
IBM Haifa Research Lab   
|-
   |Hi! I'm a signature virus! Copy  
me into

http://nadav.harel.org.il   |your signature to help me spread!

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Make TermScorer non final

2009-03-18 Thread Grant Ingersoll
See https://issues.apache.org/jira/browse/LUCENE-1017 for some  
background.  Have you measured BTQ versus the SpanTermQuery?  Position  
based stuff is often slower.


SpanQueries could use some performance assessments, that is for sure.   
Ideally, I think you should compare:

TermQuery v. SpanTQ v. BTQ

-Grant


On Mar 18, 2009, at 5:43 AM, Simon Willnauer wrote:


Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll  
gsing...@apache.org wrote:

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:


Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's  
kind of

obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups like
remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682987#action_12682987
 ] 

Michael McCandless commented on LUCENE-1522:



OK to sum up here with observations / wish list / ideas /
controversies / etc. for Lucene's future merged highlighter:

  * Fragmenter should aim for fast eye + brain scanning
consumability (eg, try hard to start on sentence boundaries,
include context)

  * Let's try for single source -- each Query/Weight/Scorer should be
able to enumerate the set of term positions/spans that caused it
to match a specific doc (like explain(), but provides
positions/spans detailing the match).  Trying to reverse
engineer the matching is brittle

  * Sliding window is better than static top down fragmentation

  * To scale, we should make a simple IndexReader impl on top of term
vectors, but still allow the re-index single doc on the fly
option

  * Favoring breadth (more unique terms instead of many occurences of
certain terms) seems important, except for too-many-term queries
where this gets unwieldy

  * Prefer a single fragment if it scores well enough, but fall back
to several, if necessary, to show breadth

  * Produce structured output so non-HTML front ends (eg Flex) can
render

  * Try to include context around the hits, when possible (eg the
favor middle of hte sentence that Michael described)

  * Maybe or maybe don't let IDF affect fragment scoring

  * Performance is important -- use TermVectors if present, add early
termination if you've already found a good enough fragdoc, etc.

  * Maybe a tree-based fragdoc enumeration / searching model; I think
this'd be even more efficient than sliding window, especially for
large docs

  * Multi-color, HeatMap default ootb HTML UIs are nice

  * It's all very subjective and quite a good challenge!!

In the meantime, it seems like we should commit this H2 and give users
the choice?  We can then iterate over time on our wish list


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682985#action_12682985
 ] 

Michael McCandless commented on LUCENE-1522:



{quote}
 ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
 spans produced by their children.
 
 Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the 
 fact that you should try to include a match from each of the sub-clauses' 
 spans.

A good idea. ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account. That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.
{quote}

I think we may need a tree-structured result returned by the
Weight/Scorer, compactly representing the space of valid fragdocs
for this one doc.  And then somehow we walk that tree,
enumerating/scoring individual valid fragdocs that are created from
that tree.

{quote}
 What I meant was: all other things being equal, do you more strongly
 favor a fragment that has all N of the terms in a query vs another
 fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in. The span 
objects only tell the Highlighter that a particular range of characters 
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on the from causing too
hot a hotspot in the heat map. So you're likely to see fragments with high
discriminatory value.
{quote}

This still seems subjectively wrong to me.  If I search for president
bush, probably bush is the rarer term and so you would favor showing
me a single fragment that had bush occur twice, over a fragment that
had a single occurrence of president and bush?

{quote}
 Google picks more than one fragment; it seems like it picks one or two
 fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment. It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.

What bothers me is that the excerpts don't scan easily right now. I consider
that a much more important defect than the fact that the fragdoc doesn't hit 
every term (which isn't even possible for large queries), and it seemed to me 
that pursuing exhaustive term matching was likely to yield even more highly 
fragmented, visually chaotic fragdocs.
{quote}

Which excerpts don't scan easily right now?  Google's, KS's, Lucene's
H1 or H2?

I think with a tree structure representing the search space for all
fragdocs, we could then efficiently enumerate fradocs with an
appropriate scoring model (favoring sentence starts or surrounding
context, breadth of terms, etc.).  This way we can do a real search
(on all fragdocs) subject to the preference for
consumability/breadth.


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - 

Re: Make TermScorer non final

2009-03-18 Thread Michael McCandless


Coming from the discussions in LUCENE-1522 (improving highlighter), I
think at some point we should merge Span*Query into their normal
counterparts, if possible.

Ie, there should be only one TermQuery that can do both what the
current TermQuery does, and also what SpanTermQuery does.  It's able
to enumerate the spans/payloads for a given document, and if you don't
request those, the performance should hopefully be equal to that of
the current TermQuery.

The highligher would in fact request spans for a normal TermQuery,
on a single doc index at a time, in order to locate the hits.

Likewise for SpanOrQuery, SpanAndQuery.

I have no real sense of how much work this is, what problems would
ensue (eg possible difference in scoring, etc.), but from
highlighter's standpoint, ideally all queries need to be able to
enumerate the collection of positions that established the match.

Mike

Grant Ingersoll wrote:

See https://issues.apache.org/jira/browse/LUCENE-1017 for some  
background.  Have you measured BTQ versus the SpanTermQuery?   
Position based stuff is often slower.


SpanQueries could use some performance assessments, that is for  
sure.  Ideally, I think you should compare:

TermQuery v. SpanTQ v. BTQ

-Grant


On Mar 18, 2009, at 5:43 AM, Simon Willnauer wrote:


Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power  
of

a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll gsing...@apache.org 
 wrote:

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:


Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's  
kind of

obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups  
like

remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Make TermScorer non final

2009-03-18 Thread Mark Miller

In some usecases this could be important especially where the power of
a span query is not required.


I think the power of a spanquery is required for payloads though - the term 
query will not hit each position to do payload loading - there is no need for 
termquery to enumerate positions. Right?




Simon Willnauer wrote:

Nothing different, I'm just concerned about the performance as the
SpanQuerys take about twice as long as a term query.
I run a little benchmark and found BoostingTermQuery being 1.5 times
slower than TermQuery without any payloads in the index.
In some usecases this could be important especially where the power of
a span query is not required.

Maybe I miss something, if so please let me know.

simon
On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll gsing...@apache.org wrote:
  

What does PayloadTermQuery do that BoostingTermQuery doesn't do?

-Grant

On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:



Hi, I looked at TermScorer today in order implement a TermQuery to
utilize Payloads from the index.
I realized that this class is final in the current trunk. It's kind of
obvious that is is declared final for optimization purposes.
I wanna know if it is possible to make it non final in the next
release or later to use it in a PayloadTermQuery class.
I would like to reuse this code and do some additional cleanups like
remove the code redundancy in score() / score(HitCollector, int).

Thanks,
Simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Make TermScorer non final

2009-03-18 Thread Simon Willnauer
On Wed, Mar 18, 2009 at 1:32 PM, Mark Miller markrmil...@gmail.com wrote:
 In some usecases this could be important especially where the power of
 a span query is not required.

 I think the power of a spanquery is required for payloads though - the term
 query will not hit each position to do payload loading - there is no need
 for termquery to enumerate positions. Right?
No you are right, term query does not need to enumerate the TermPositions.
This doesn't mean that it can not look at them. Issue
https://issues.apache.org/jira/browse/LUCENE-1017 has apparently done
some measurements without a significant performance improvement. I
didn't expect an large improvement anyway but without the knowledge of
this issue it was worth to look at it.

One thing I wanna mention aside: As long as TermScorer is final there
is no problem with the implementation beside some redundant code. The
TermScorer does not use the float score() method to calculate the
score in score(HitCollector, int). It rather duplicates the code for
performance reasons I assume. I have cleaned up this code a little bit
as I was going to implement payloads using this class. If it is
desirable to have this code cleaned up I can submit a patch.

Thanks,

simon




 Simon Willnauer wrote:

 Nothing different, I'm just concerned about the performance as the
 SpanQuerys take about twice as long as a term query.
 I run a little benchmark and found BoostingTermQuery being 1.5 times
 slower than TermQuery without any payloads in the index.
 In some usecases this could be important especially where the power of
 a span query is not required.

 Maybe I miss something, if so please let me know.

 simon
 On Tue, Mar 17, 2009 at 11:15 PM, Grant Ingersoll gsing...@apache.org
 wrote:


 What does PayloadTermQuery do that BoostingTermQuery doesn't do?

 -Grant

 On Mar 17, 2009, at 1:27 PM, Simon Willnauer wrote:



 Hi, I looked at TermScorer today in order implement a TermQuery to
 utilize Payloads from the index.
 I realized that this class is final in the current trunk. It's kind of
 obvious that is is declared final for optimization purposes.
 I wanna know if it is possible to make it non final in the next
 release or later to use it in a PayloadTermQuery class.
 I would like to reuse this code and do some additional cleanups like
 remove the code redundancy in score() / score(HitCollector, int).

 Thanks,
 Simon

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683030#action_12683030
 ] 

Marvin Humphrey commented on LUCENE-1522:
-

 I think we may need a tree-structured result returned by the
 Weight/Scorer, compactly representing the space of valid fragdocs
 for this one doc. And then somehow we walk that tree,
 enumerating/scoring individual valid fragdocs that are created from
 that tree.

Something like that.  An array of span scores is too limited; a full fledged
class would do better.  Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.  By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.  

 However, note that IDF would prevent a bunch of hits on the from causing 
 too
 hot a hotspot in the heat map. So you're likely to see fragments with high
 discriminatory value.
 
 This still seems subjectively wrong to me. If I search for president
 bush, probably bush is the rarer term and so you would favor showing
 me a single fragment that had bush occur twice, over a fragment that
 had a single occurrence of president and bush?

We've ended up in a false dichotomy.  Favoring high IDF terms -- or more
accurately, high scoring character position spans -- and favoring fragments 
with high term diversity are not mutually exclusive.  

Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So bush bush might be prefered over president bush, 
but bush or bush proabably wouldn't.

I don't think that there's anything wrong with preferring high term diversity;
the KS highlighter doesn't happen to support favoring fragments with high term
diversity now, but would be improved by adding that capability.  I just don't
think term diversity is so important that it qualifies as a base litmus
test.

There are other ways of choosing good fragments, and IDF is one of them.  If
you want to show why a doc matched a query, it makes sense to show the section
of the document that contributed most to the score, surrounded by a little
context.  

 Which excerpts don't scan easily right now? Google's, KS's, Lucene's
 H1 or H2?

Lucene H1.  Too many elipses, and fragments don't prefer to start on sentence
boundaries.  

I have to qualify the assertion that the fragments don't scan well with the 
caveat 
that I'm basing this on a personal impression.  However, I'm pretty confident 
about that impression.  I would be stunned if there were not studies out there
demonstrating that sentence fragments which begin at the top are easier to
consume than sentence fragments which begin in the middle.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary 

[jira] Assigned: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

2009-03-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-1550:
---

Assignee: Grant Ingersoll

 Add N-Gram String Matching for Spell Checking
 -

 Key: LUCENE-1550
 URL: https://issues.apache.org/jira/browse/LUCENE-1550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.9
Reporter: Thomas Morton
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1550.patch


 N-Gram version of edit distance based on paper by Grzegorz Kondrak, N-gram 
 similarity and distance. Proceedings of the Twelfth International Conference 
 on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  
 Buenos Aires, Argentina, November 2005. 
 http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683032#action_12683032
 ] 

Mark Miller commented on LUCENE-1522:
-

bq. Lucene H1. Too many elipses, and fragments don't prefer to start on 
sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic 
implementations we currently supply for the pluggable classes. You can supply a 
custom fragmenter and you can control the number of fragments.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683053#action_12683053
 ] 

Michael McCandless commented on LUCENE-1522:



{quote}
Something like that. An array of span scores is too limited; a full fledged
class would do better. Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.
{quote}

Agreed, and I'm not sure about the tree structure (just floating
ideas...).  It could very well be overkill.

{quote}
By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.
{quote}

It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself
will have returned that.  Ie we would add a method to
getSpanTree().

{quote}
Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So bush bush might be prefered over president bush, 
but bush or bush proabably wouldn't.
{quote}

OK, it sounds like one can simply use different models to score
fragdocs and it's still an open debate how much each of these criteria
(IDF, showing surround context, being on sentence boundary, diversity
of terms) should impact the score.  I agree, the basic litmus test I
proposed is too strong.

{quote}
bq. Lucene H1. Too many elipses, and fragments don't prefer to start on 
sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic
implementations we currently supply for the pluggable classes. You can
supply a custom fragmenter and you can control the number of
fragments.
{quote}

I agree: H1 is very pluggable and one could plug in a better
fragmenter, but we don't offer such an impl in H1, and this is a case
where out-of-the-box defaults are very important.


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1145:
--

Assignee: Michael McCandless

 DisjunctionSumScorer small tweak
 

 Key: LUCENE-1145
 URL: https://issues.apache.org/jira/browse/LUCENE-1145
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
 Environment: all
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: DisjunctionSumScorerOptimization.patch, 
 DSSQueueSizeOptimization.patch, TestScorerPerformance.java


 Move ScorerDocQueue initialization from next() and skipTo() methods to the 
 Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
 tests). 
 Downside (if this is one, I cannot judge) would be throwing IOException from 
 DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
 problem as this IOException does not propagate too far (the only modification 
 I made is in BooleanScorer2)
 if (scorerDocQueue == null) {
   initScorerDocQueue();
 }
  
 Attached test is just quick  dirty rip of  TestScorerPerf from standard 
 Lucene test package. Not included as patch as I do not like it.
 All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



GSoC 09 project ideas...

2009-03-18 Thread Zaid Md. Abdul Wahab Sheikh
Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there are no
project ideas for Lucene proper. (Only ideas for Mahout listed). Please put
up some ideas for Lucene there or please mention some popular open issues
that might be suitable as a GSoC project.
I would very much like to work on Lucene during Summer of Code 09. I am
currently researching/doing a project on Realtime search.
It seems, a contrib exists for realtime search in Lucene.
http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an
update on its status? Is that sufficient/complete, or should I start
investigating possibilities of integrating 'realtime' search in Lucene.
Please comment.

Z.S.


[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683058#action_12683058
 ] 

Michael McCandless commented on LUCENE-1145:


I plan to commit shortly.

 DisjunctionSumScorer small tweak
 

 Key: LUCENE-1145
 URL: https://issues.apache.org/jira/browse/LUCENE-1145
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
 Environment: all
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: DisjunctionSumScorerOptimization.patch, 
 DSSQueueSizeOptimization.patch, TestScorerPerformance.java


 Move ScorerDocQueue initialization from next() and skipTo() methods to the 
 Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
 tests). 
 Downside (if this is one, I cannot judge) would be throwing IOException from 
 DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
 problem as this IOException does not propagate too far (the only modification 
 I made is in BooleanScorer2)
 if (scorerDocQueue == null) {
   initScorerDocQueue();
 }
  
 Attached test is just quick  dirty rip of  TestScorerPerf from standard 
 Lucene test package. Not included as patch as I do not like it.
 All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683064#action_12683064
 ] 

Marvin Humphrey commented on LUCENE-1522:
-

 OK, it sounds like one can simply use different models to score
 fragdocs and it's still an open debate how much each of these criteria
 (IDF, showing surround context, being on sentence boundary, diversity
 of terms) should impact the score. 

With Michael Busch's priority queue approach, the algorithm for choosing the
fragments can be abstracted into the class of object we put in the queue and
its lessThan() method.  The output from the queue just has to be something the
Highlighter can chew.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: GSoC 09 project ideas...

2009-03-18 Thread Jason Rutherglen
Hi Z.S.,

I'll update LUCENE-1313 after LUCENE-1516 is committed.  I can post the
basic new patch I have for LUCENE-1313 (heavily simplified compared to the
previous patches), however it will assume LUCENE-1516.  The other area that
will need to be addressed is standard benchmarking for different realtime
search approaches as we don't know what will be best yet.

What areas in regard to realtime search are you working on?

-J

On Wed, Mar 18, 2009 at 9:04 AM, Zaid Md. Abdul Wahab Sheikh 
sheikh.z...@gmail.com wrote:

 Hi lucene,
 In this link http://wiki.apache.org/general/SummerOfCode2009 , there are
 no project ideas for Lucene proper. (Only ideas for Mahout listed). Please
 put up some ideas for Lucene there or please mention some popular open
 issues that might be suitable as a GSoC project.
 I would very much like to work on Lucene during Summer of Code 09. I am
 currently researching/doing a project on Realtime search.
 It seems, a contrib exists for realtime search in Lucene.
 http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an
 update on its status? Is that sufficient/complete, or should I start
 investigating possibilities of integrating 'realtime' search in Lucene.
 Please comment.

 Z.S.



Re: GSoC 09 project ideas...

2009-03-18 Thread Michael McCandless


I think creating a better Highlighter for Lucene, which is actively
being discussed:

https://issues.apache.org/jira/browse/LUCENE-1522

would make a good GSoC project, but I don't think I have time to mentor.

Realtime search is currently in progress already, being tracked/iterated
here:

https://issues.apache.org/jira/browse/LUCENE-1516

The original Ocean (LUCENE-1313) that you found was a more ambitious
approach, which after discussions here eventually lead to the simpler
approach in LUCENE-1516.

Mike

Abdul Wahab Sheikh wrote:


Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there  
are no project ideas for Lucene proper. (Only ideas for Mahout  
listed). Please put up some ideas for Lucene there or please mention  
some popular open issues that might be suitable as a GSoC project.
I would very much like to work on Lucene during Summer of Code 09. I  
am currently researching/doing a project on Realtime search.
It seems, a contrib exists for realtime search in Lucene. http://issues.apache.org/jira/browse/LUCENE-1313 
. Can anyone give me an update on its status? Is that sufficient/ 
complete, or should I start investigating possibilities of  
integrating 'realtime' search in Lucene. Please comment.


Z.S.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1145) DisjunctionSumScorer small tweak

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1145.


Resolution: Fixed

Thanks Eks and Paul!

 DisjunctionSumScorer small tweak
 

 Key: LUCENE-1145
 URL: https://issues.apache.org/jira/browse/LUCENE-1145
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
 Environment: all
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: DisjunctionSumScorerOptimization.patch, 
 DSSQueueSizeOptimization.patch, TestScorerPerformance.java


 Move ScorerDocQueue initialization from next() and skipTo() methods to the 
 Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
 tests). 
 Downside (if this is one, I cannot judge) would be throwing IOException from 
 DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
 problem as this IOException does not propagate too far (the only modification 
 I made is in BooleanScorer2)
 if (scorerDocQueue == null) {
   initScorerDocQueue();
 }
  
 Attached test is just quick  dirty rip of  TestScorerPerf from standard 
 Lucene test package. Not included as patch as I do not like it.
 All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1472) DateTools.stringToDate() can cause lock contention under load

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1472:
---

Fix Version/s: (was: 2.9)

Removing 2.9 target.

 DateTools.stringToDate() can cause lock contention under load
 -

 Key: LUCENE-1472
 URL: https://issues.apache.org/jira/browse/LUCENE-1472
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor

 Load testing our application (the JIRA Issue Tracker) has shown that threads 
 spend a lot of time blocked in DateTools.stringToDate().
 The stringToDate() method uses a singleton SimpleDateFormat object to parse 
 the dates.
 Each call to SimpleDateFormat.parse() is *synchronized* because 
 SimpleDateFormat is not thread safe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-18 Thread David Kaelbling (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683079#action_12683079
 ] 

David Kaelbling commented on LUCENE-1522:
-

Hi,

Our application wants to find and highlight all the hits in a document,
not just the best one(s).  If future highlighters still allowed this,
even if only by judicious use of subclasses, I would be happy :-)

Thanks,
David

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

dkaelbl...@blackducksoftware.com
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com




 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1561:
---

Attachment: LUCENE-1561.patch

Attached patch.  I renamed to omitTermFreqAndPositions, and added a NOTE to 
the javadoc about positional queries silently not working when you use this 
option.  I plan to commit in a day or so.

 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1561:
--

Assignee: Michael McCandless

 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1490:
--

Assignee: Michael McCandless

 CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
 --

 Key: LUCENE-1490
 URL: https://issues.apache.org/jira/browse/LUCENE-1490
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Daniel Cheng
Assignee: Michael McCandless
 Fix For: 2.4, 2.9


 CJKTokenizer have these lines..
 if (ub == 
 Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }
 This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
 counterparts.
 Only 65281-65374 can be converted this way.
 The fix is
  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
  i = 65474  i 65281) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1490.


Resolution: Fixed

Thanks Daniel!

 CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
 --

 Key: LUCENE-1490
 URL: https://issues.apache.org/jira/browse/LUCENE-1490
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Daniel Cheng
Assignee: Michael McCandless
 Fix For: 2.9, 2.4


 CJKTokenizer have these lines..
 if (ub == 
 Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }
 This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
 counterparts.
 Only 65281-65374 can be converted this way.
 The fix is
  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
  i = 65474  i 65281) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Make TermScorer non final

2009-03-18 Thread Grant Ingersoll


On Mar 18, 2009, at 7:57 AM, Michael McCandless wrote:



Coming from the discussions in LUCENE-1522 (improving highlighter), I
think at some point we should merge Span*Query into their normal
counterparts, if possible.

Ie, there should be only one TermQuery that can do both what the
current TermQuery does, and also what SpanTermQuery does.  It's able
to enumerate the spans/payloads for a given document, and if you don't
request those, the performance should hopefully be equal to that of
the current TermQuery.

The highligher would in fact request spans for a normal TermQuery,
on a single doc index at a time, in order to locate the hits.

Likewise for SpanOrQuery, SpanAndQuery.

I have no real sense of how much work this is, what problems would
ensue (eg possible difference in scoring, etc.), but from
highlighter's standpoint, ideally all queries need to be able to
enumerate the collection of positions that established the match.


Maybe they should all implement a common Interface that provides  
highlighting info?  I don't know what it would be, but it seems easier  
to do that then to merge them all, but I'm not sure.  Not that I  
wouldn't want to see a simpler query system.   There's some cool  
things you can do w/ spans, but they still have some fundamental flaws  
that make them annoying.  Namely, often times one of the reasons you  
want Spans is b/c you care about what is going on around the match,  
i.e. co-occurrence data, yet it is still annoying/difficult to get  
that information w/o pivoting around either term vectors or re  
analyzing the document.  With the new Attribute stuff, however, it  
might be getting a little easier, as one could now store offset  
information at the term level (which you can do w/ payloads, too) and  
then use that to index into the original String.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1526) Tombstone deletions in IndexReader

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1526:
---

Fix Version/s: (was: 2.9)

I don't think we should block 2.9 for this.

 Tombstone deletions in IndexReader
 --

 Key: LUCENE-1526
 URL: https://issues.apache.org/jira/browse/LUCENE-1526
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 168h
  Remaining Estimate: 168h

 SegmentReader currently uses a BitVector to represent deleted docs.
 When performing rapid clone (see LUCENE-1314) and delete operations,
 performing a copy on write of the BitVector can become costly because
 the entire underlying byte array must be created and copied. A way to
 make this clone delete process faster is to implement tombstones, a
 term coined by Marvin Humphrey. Tombstones represent new deletions
 plus the incremental deletions from previously reopened readers in
 the current reader. 
 The proposed implementation of tombstones is to accumulate deletions
 into an int array represented as a DocIdSet. With LUCENE-1476,
 SegmentTermDocs iterates over deleted docs using a DocIdSet rather
 than accessing the BitVector by calling get. This allows a BitVector
 and a set of tombstones to by ANDed together as the current reader's
 delete docs. 
 A tombstone merge policy needs to be defined to determine when to
 merge tombstone DocIdSets into a new deleted docs BitVector as too
 many tombstones would eventually be detrimental to performance. A
 probable implementation will merge tombstones based on the number of
 tombstones and the total number of documents in the tombstones. The
 merge policy may be set in the clone/reopen methods or on the
 IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1533) Deleted documents as a Filter or top level Query

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1533:
---

Fix Version/s: (was: 2.9)

Clearing fix version.

 Deleted documents as a Filter or top level Query
 

 Key: LUCENE-1533
 URL: https://issues.apache.org/jira/browse/LUCENE-1533
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 504h
  Remaining Estimate: 504h

 In exploring alternative and perhaps faster ways to implement the
 deleted documents functionality, the idea of filtering the deleted
 documents at a higher level came up. This system would save on
 checking the deleted docs BitVector of each doc read from the posting
 list by SegmentTermDocs. This is equivalent to an AND NOT deleted
 docs query.
 If the patch improves the speed of indexes with delete documents,
 many core unit tests will need to change, or alternatively the
 functionality provided by this patch can be an IndexReader option.
 I'm thinking the first implementation will be a Filter in
 IndexSearcher. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: GSoC 09 project ideas...

2009-03-18 Thread Grant Ingersoll


On Mar 18, 2009, at 12:04 PM, Zaid Md. Abdul Wahab Sheikh wrote:


Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there  
are no project ideas for Lucene proper. (Only ideas for Mahout  
listed).


This requires someone (has to be a committer) willing to mentor.  I'd  
love to see a Lucene GSOC project, but I'm already mentoring on Mahout  
and don't have time for more than one.


Please put up some ideas for Lucene there or please mention some  
popular open issues that might be suitable as a GSoC project.


As for ideas, what the others said would be good, I'd also add in:
Design/implement the query side of the new TokenStream Attribute stuff  
so that we are closer to flexible indexing.


New/updated demo would be great, one that shows off more of Lucene.

-Grant

[jira] Assigned: (LUCENE-652) Compressed fields should be externalized (from Fields into Document)

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-652:
-

Assignee: Michael McCandless

 Compressed fields should be externalized (from Fields into Document)
 --

 Key: LUCENE-652
 URL: https://issues.apache.org/jira/browse/LUCENE-652
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.9, 2.0.0, 2.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 Right now, as of 2.0 release, Lucene supports compressed stored fields.  
 However, after discussion on java-dev, the suggestion arose, from Robert 
 Engels, that it would be better if this logic were moved into the Document 
 level.  This way the indexing level just stores opaque binary fields, and 
 then Document handles compress/uncompressing as needed.
 This approach would have prevented issues like LUCENE-629 because merging of 
 segments would never need to decompress.
 See this thread for the recent discussion:
 http://www.gossamer-threads.com/lists/lucene/java-dev/38836
 When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



move TrieRange* to core?

2009-03-18 Thread Michael McCandless

I think we should move TrieRange* into core before 2.9?

It's received alot of attention, from both developers (Uwe  Yonik did
lots of iterations, and Solr is folding it in) and user interest.

It's a simpler  more scalable way to index numeric fields that you
intend to sort and/or do range querying on; we can do away with tricky
number padding.

Plus it's just plain cool :)

I also think we should change its name.  I know and love trie, but
it's a very technical term that's not immediately meaningful to users
of Lucene's API.  Plus I've learned from doing too many renamings
lately that it's best to try to get the name right at the start.

Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
AbstractNumberRangeFilter?

Thoughts?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-652) Compressed fields should be externalized (from Fields into Document)

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-652:
--

Attachment: LUCENE-652.patch

I added o.a.l.document.CompressionTools, with static methods to
compress  decompress, and deprecated Field.Store.COMPRESS.

I also found two separate bugs:

  * With Field.Store.COMPRESS we were running compression twice
(unnecessarily); I've fixed that.

  * If you try to make a Field(byte[], int offset, int length,
Store.COMPRESS), you'll hit an AIOOBE.  I think we don't need to
fix this one since it's in now-deprecated code, and with 2.9,
users can migrate to CompressionTools.

I plan to commit in a day or two.


 Compressed fields should be externalized (from Fields into Document)
 --

 Key: LUCENE-652
 URL: https://issues.apache.org/jira/browse/LUCENE-652
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 1.9, 2.0.0, 2.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-652.patch


 Right now, as of 2.0 release, Lucene supports compressed stored fields.  
 However, after discussion on java-dev, the suggestion arose, from Robert 
 Engels, that it would be better if this logic were moved into the Document 
 level.  This way the indexing level just stores opaque binary fields, and 
 then Document handles compress/uncompressing as needed.
 This approach would have prevented issues like LUCENE-629 because merging of 
 segments would never need to decompress.
 See this thread for the recent discussion:
 http://www.gossamer-threads.com/lists/lucene/java-dev/38836
 When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Andi Vajda


On Mar 18, 2009, at 13:01, Michael McCandless  
luc...@mikemccandless.com wrote:



I think we should move TrieRange* into core before 2.9?

It's received alot of attention, from both developers (Uwe  Yonik did
lots of iterations, and Solr is folding it in) and user interest.

It's a simpler  more scalable way to index numeric fields that you
intend to sort and/or do range querying on; we can do away with tricky
number padding.

Plus it's just plain cool :)

I also think we should change its name.  I know and love trie, but
it's a very technical term that's not immediately meaningful to users
of Lucene's API.  Plus I've learned from doing too many renamings
lately that it's best to try to get the name right at the start.

Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
AbstractNumberRangeFilter?


+1

How about NumericRangeFilter ?

Andi..




Thoughts?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1496) Move solr NumberUtils to lucene

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683149#action_12683149
 ] 

Michael McCandless commented on LUCENE-1496:


If we move trie/* into core, what do we need/want to fold in from Solr's 
NumberUtils?

 Move solr NumberUtils to lucene
 ---

 Key: LUCENE-1496
 URL: https://issues.apache.org/jira/browse/LUCENE-1496
 Project: Lucene - Java
  Issue Type: Task
Reporter: Ryan McKinley
Priority: Trivial
 Fix For: 2.9


 solr includes a NumberUtils class with some general utilities for dealing 
 with tokens and numbers.
 This should be in lucene rather then solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Earwin Burrfoot
On Wed, Mar 18, 2009 at 23:08, Andi Vajda va...@osafoundation.org wrote:

 On Mar 18, 2009, at 13:01, Michael McCandless luc...@mikemccandless.com
 wrote:

 I think we should move TrieRange* into core before 2.9?

 It's received alot of attention, from both developers (Uwe  Yonik did
 lots of iterations, and Solr is folding it in) and user interest.

 It's a simpler  more scalable way to index numeric fields that you
 intend to sort and/or do range querying on; we can do away with tricky
 number padding.

 Plus it's just plain cool :)

 I also think we should change its name.  I know and love trie, but
 it's a very technical term that's not immediately meaningful to users
 of Lucene's API.  Plus I've learned from doing too many renamings
 lately that it's best to try to get the name right at the start.

 Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
 AbstractNumberRangeFilter?

 +1

 How about NumericRangeFilter ?
The idea behind this filter can be applied to more than just numbers,
so I'd like to put the stress on its speed or idea used -
FastRangeQuery, TrieRangeQuery, SegmentedRangeQuery (from the fact it
splits input range into variable-precision segments), PrefixRangeQuery
(you can reword the algorithm in terms of prefixes)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1435:
--

Assignee: Michael McCandless

 CollationKeyFilter: convert tokens into CollationKeys encoded using 
 IndexableBinaryStringTools
 --

 Key: LUCENE-1435
 URL: https://issues.apache.org/jira/browse/LUCENE-1435
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch


 Converts each token into its CollationKey using the provided collator, and 
 then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
 be stored as an index term.
 This will allow for efficient range searches and Sorts over fields that need 
 collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683155#action_12683155
 ] 

Michael McCandless commented on LUCENE-1435:


I think we should commit this to contrib/collation as an external way to get 
faster range filters on fields that require custom Collator; at some future 
point we can consider allowing a given field to sort its terms in some custom 
way.

Marvin: does KS/Lucy give control over sort order of the terms in a field?

 CollationKeyFilter: convert tokens into CollationKeys encoded using 
 IndexableBinaryStringTools
 --

 Key: LUCENE-1435
 URL: https://issues.apache.org/jira/browse/LUCENE-1435
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.4
Reporter: Steven Rowe
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch


 Converts each token into its CollationKey using the provided collator, and 
 then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
 be stored as an index term.
 This will allow for efficient range searches and Sorts over fields that need 
 collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1434:
--

Assignee: Michael McCandless

 IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
 that can be used as index terms, and vice versa
 -

 Key: LUCENE-1434
 URL: https://issues.apache.org/jira/browse/LUCENE-1434
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1434.patch


 Provides support for converting byte sequences to Strings that can be used as 
 index terms, and back again. The resulting Strings preserve the original byte 
 sequences' sort order (assuming the bytes are interpreted as unsigned).
 The Strings are constructed using a Base 8000h encoding of the original 
 binary data - each char of an encoded String represents a 15-bit chunk from 
 the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
 bits of char to be used without restriction; the surrogate range 
 [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
 handling to avoid them and allow use of char's high bit.
 This class is intended to serve as a mechanism to allow CollationKeys to 
 serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler
I have no problem with it! Thanks!

What I would like to be fixed before moving it to core is the fact that a
additional helper field is needed for the trie values. If everything could
be in one field and the field is still sortable, it would be fine. For that,
the order of terms in the FieldCache should be fixed. As current trie fields
of highest precision order before all other lower precision field, the
simpliest fix would be to only index the first first term from TermEnum at
the documents index in the FieldCache.

Another way would be to just invert the order and let the higher precision
fields appear at last in the TermEnum. Both would be possible, but there
should be a clear statement, which term for multi-term-fields is put into
FieldCache (maybe configureable). See LUCENE-1372 for that.

If all terms could be in one field, the API to TrieRange could be simplier
and more effective for the GC. The trieCodeLong/Int() method would just
return a TokenStream that can be indexed using new
Field(Name,TokenStream), more effectively using the Token's char buffer
during trie encoding (it could be reused). This is how it is done by Solr at
the moment (but with the additional allocation of the array) - I do not like
the array allocations for each term and the whole trie-encoding at the
moment (1x char[], 1x String[], additional copying,...).

I would be happy to have it in core, I could prepare the patch, when the
above is fixed!

As names: NumberUtils, IntRangeFilter, LongRangeFilter is fine,
AbstractNumberRangeFilter is internal only (just to have less code
duplication, like StringBuffer and StringBuilder in JDK, both coming from a
internal superclass invisible to outside)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Wednesday, March 18, 2009 9:02 PM
 To: java-dev@lucene.apache.org
 Subject: move TrieRange* to core?
 
 I think we should move TrieRange* into core before 2.9?
 
 It's received alot of attention, from both developers (Uwe  Yonik did
 lots of iterations, and Solr is folding it in) and user interest.
 
 It's a simpler  more scalable way to index numeric fields that you
 intend to sort and/or do range querying on; we can do away with tricky
 number padding.
 
 Plus it's just plain cool :)
 
 I also think we should change its name.  I know and love trie, but
 it's a very technical term that's not immediately meaningful to users
 of Lucene's API.  Plus I've learned from doing too many renamings
 lately that it's best to try to get the name right at the start.
 
 Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
 AbstractNumberRangeFilter?
 
 Thoughts?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683167#action_12683167
 ] 

Michael McCandless commented on LUCENE-1435:


Steven, I'm hitting compilation errors, eg:

{code}
[javac] 
/tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42:
 package org.apache.lucene.queryParser.analyzing does not exist
[javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser;
[javac]   ^
[javac] 
/tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89:
 cannot find symbol
{code}

What is AnalyzingQueryParser?

 CollationKeyFilter: convert tokens into CollationKeys encoded using 
 IndexableBinaryStringTools
 --

 Key: LUCENE-1435
 URL: https://issues.apache.org/jira/browse/LUCENE-1435
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch


 Converts each token into its CollationKey using the provided collator, and 
 then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
 be stored as an index term.
 This will allow for efficient range searches and Sorts over fields that need 
 collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683171#action_12683171
 ] 

Michael McCandless commented on LUCENE-1434:


This looks good.  I plan to commit shortly!

 IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
 that can be used as index terms, and vice versa
 -

 Key: LUCENE-1434
 URL: https://issues.apache.org/jira/browse/LUCENE-1434
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1434.patch


 Provides support for converting byte sequences to Strings that can be used as 
 index terms, and back again. The resulting Strings preserve the original byte 
 sequences' sort order (assuming the bytes are interpreted as unsigned).
 The Strings are constructed using a Base 8000h encoding of the original 
 binary data - each char of an encoded String represents a 15-bit chunk from 
 the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
 bits of char to be used without restriction; the surrogate range 
 [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
 handling to avoid them and allow use of char's high bit.
 This class is intended to serve as a mechanism to allow CollationKeys to 
 serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683174#action_12683174
 ] 

Steven Rowe commented on LUCENE-1435:
-

It's in contrib/miscellaneous/

I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be 
applied to the terms in the range query - the standard QueryParser doesn't 
analyze range terms.

From:

http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

bq. Overrides Lucene's default QueryParser so that Fuzzy-, Prefix-, Range-, and 
WildcardQuerys are also passed through the given analyzer, but wild card 
characters (like *) don't get removed from the search terms. 

This is a (test-only) cross-contrib dependency.  I'm not sure why I didn't have 
trouble with compilation - I haven't looked at this in months.  I'll take a 
look later on tonight.

 CollationKeyFilter: convert tokens into CollationKeys encoded using 
 IndexableBinaryStringTools
 --

 Key: LUCENE-1435
 URL: https://issues.apache.org/jira/browse/LUCENE-1435
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch


 Converts each token into its CollationKey using the provided collator, and 
 then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
 be stored as an index term.
 This will allow for efficient range searches and Sorts over fields that need 
 collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



File Formats Correction

2009-03-18 Thread Mark Miller

Just a note so I don't forget:

The file formats page says their are 4 files used for termvectors but 
their is only 3 that I can see: tvx tvd tvf.


http://lucene.apache.org/java/2_4_1/fileformats.html

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1434) IndexableBinaryStringTools: convert arbitrary byte sequences into Strings that can be used as index terms, and vice versa

2009-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1434.


Resolution: Fixed

Thanks Steven!

 IndexableBinaryStringTools: convert arbitrary byte sequences into Strings 
 that can be used as index terms, and vice versa
 -

 Key: LUCENE-1434
 URL: https://issues.apache.org/jira/browse/LUCENE-1434
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1434.patch


 Provides support for converting byte sequences to Strings that can be used as 
 index terms, and back again. The resulting Strings preserve the original byte 
 sequences' sort order (assuming the bytes are interpreted as unsigned).
 The Strings are constructed using a Base 8000h encoding of the original 
 binary data - each char of an encoded String represents a 15-bit chunk from 
 the byte sequence.  Base 8000h was chosen because it allows for all lower 15 
 bits of char to be used without restriction; the surrogate range 
 [U+D800-U+DFFF] does not represent valid chars, and would require complicated 
 handling to avoid them and allow use of char's high bit.
 This class is intended to serve as a mechanism to allow CollationKeys to 
 serve as index terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2009-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683182#action_12683182
 ] 

Michael McCandless commented on LUCENE-1435:


OK, thanks for the pointer -- I learn something new every day!

 CollationKeyFilter: convert tokens into CollationKeys encoded using 
 IndexableBinaryStringTools
 --

 Key: LUCENE-1435
 URL: https://issues.apache.org/jira/browse/LUCENE-1435
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.4
Reporter: Steven Rowe
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch


 Converts each token into its CollationKey using the provided collator, and 
 then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
 be stored as an index term.
 This will allow for efficient range searches and Sorts over fields that need 
 collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: File Formats Correction

2009-03-18 Thread Michael McCandless


Indeed!  I'll fix on trunk.

Mike

Mark Miller wrote:


Just a note so I don't forget:

The file formats page says their are 4 files used for termvectors  
but their is only 3 that I can see: tvx tvd tvf.


http://lucene.apache.org/java/2_4_1/fileformats.html

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler
  I think we should move TrieRange* into core before 2.9?
 
  It's received alot of attention, from both developers (Uwe  Yonik did
  lots of iterations, and Solr is folding it in) and user interest.
 
  It's a simpler  more scalable way to index numeric fields that you
  intend to sort and/or do range querying on; we can do away with tricky
  number padding.
 
  Plus it's just plain cool :)
 
  I also think we should change its name.  I know and love trie, but
  it's a very technical term that's not immediately meaningful to users
  of Lucene's API.  Plus I've learned from doing too many renamings
  lately that it's best to try to get the name right at the start.
 
  Maybe just NumberUtils, IntRangeFilter, LongRangeFilter,
  AbstractNumberRangeFilter?
 
  +1
 
  How about NumericRangeFilter ?
 The idea behind this filter can be applied to more than just numbers,
 so I'd like to put the stress on its speed or idea used -
 FastRangeQuery, TrieRangeQuery, SegmentedRangeQuery (from the fact it
 splits input range into variable-precision segments), PrefixRangeQuery
 (you can reword the algorithm in terms of prefixes)

Trie  is also known as Prefix Tree, because of that and the usage, I called it 
TrieRange [see http://en.wikipedia.org/wiki/Trie: the original term trie 
comes from retrieval. Following the etymology, the inventor, Edward Fredkin, 
pronounces it [tɹi] (tree). However, it is pronounced [tɹaɪ] (try) by other 
authors].

So we have two possibilities:

- a generic name completely hiding the internals -- but then the complexity 
with the helper field should be hidden, how should precisionStep called and 
justified then?
- a name describing how it works, like Earwin suggested - so we could stay with 
TrieRange. 

The name TrieRangeQuery first appeared in [1], so it should be noted 
somewhere, even if it is renamed to NumberRangeFilter or something else... :-) 
I would be happy with a renaming to NumberRangeFilter, but trie should 
appear somewhere in the docs.

Uwe

[1] Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for 
Metadata Portals. Computers  Geosciences 34 (12), 1947-1955. 
doi:10.1016/j.cageo.2008.02.023


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless


Uwe Schindler wrote:

I would be happy with a renaming to NumberRangeFilter, but trie  
should appear somewhere in the docs.


I like this approach (and referencing the original paper); I think
it's important the javadocs give enough detail about how it works so
that one can understand the big picture and what precisionStep does,
but I think the name should more reflect how it's used rather than how
it's implemented.

I realize TrieRangeFilter can be used for anything that can be
accurately represented as int or long in Java (eg Date), but I would
expect numeric sorting/range-filtering is the vast majority of cases.

Naming is the hardest part!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless


Uwe Schindler wrote:


I have no problem with it! Thanks!

What I would like to be fixed before moving it to core is the fact  
that a
additional helper field is needed for the trie values. If everything  
could
be in one field and the field is still sortable, it would be fine.  
For that,
the order of terms in the FieldCache should be fixed. As current  
trie fields

of highest precision order before all other lower precision field, the
simpliest fix would be to only index the first first term from  
TermEnum at

the documents index in the FieldCache.

Another way would be to just invert the order and let the higher  
precision
fields appear at last in the TermEnum. Both would be possible, but  
there
should be a clear statement, which term for multi-term-fields is put  
into

FieldCache (maybe configureable). See LUCENE-1372 for that.


Though, won't this make loading the field cache more costly since
you'll iterate through many more terms?

If all terms could be in one field, the API to TrieRange could be  
simplier
and more effective for the GC. The trieCodeLong/Int() method would  
just

return a TokenStream that can be indexed using new
Field(Name,TokenStream), more effectively using the Token's char  
buffer
during trie encoding (it could be reused). This is how it is done by  
Solr at
the moment (but with the additional allocation of the array) - I do  
not like

the array allocations for each term and the whole trie-encoding at the
moment (1x char[], 1x String[], additional copying,...).


I agree it'd be awesome to have a less GC costly translation
during indexing.

I would be happy to have it in core, I could prepare the patch, when  
the

above is fixed!


OK.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless
Michael McCandless luc...@mikemccandless.com wrote:

 Though, won't this make loading the field cache more costly since
 you'll iterate through many more terms?

Or... do the full precision fields always order above all lower
precision fields across all docs?

If so... maybe we could extend FieldCache's parser to allow it to
stop-early?  Ie it'd get the TermEnum, iterate through all the full
precision terms first, asking your parser to convert to long/int, and
then when your parser sees the very first not-full-precision term, it
tells FieldCache to stop.

Would that work?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: move TrieRange* to core?

2009-03-18 Thread Uwe Schindler
  Though, won't this make loading the field cache more costly since
  you'll iterate through many more terms?
 
 Or... do the full precision fields always order above all lower
 precision fields across all docs?

The highest precision terms have a shift value of 0. As the first char of
the encoded value is the shift, the terms are ordered by shift value first
and so the highest precision is coming first (because 0 is the smallest
shift).

 If so... maybe we could extend FieldCache's parser to allow it to
 stop-early?  Ie it'd get the TermEnum, iterate through all the full
 precision terms first, asking your parser to convert to long/int, and
 then when your parser sees the very first not-full-precision term, it
 tells FieldCache to stop.
 
 Would that work?

Yes, good idea! In this case it is really better, that the higher precision
terms come first. The question is how to implement that / extend the current
API.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2009-03-18 Thread Daniel Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683240#action_12683240
 ] 

Daniel Cheng commented on LUCENE-1490:
--

This was discovered by Chan 
http://www.cnblogs.com/jjstar/archive/2006/12/20/598016.html

 CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
 --

 Key: LUCENE-1490
 URL: https://issues.apache.org/jira/browse/LUCENE-1490
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Daniel Cheng
Assignee: Michael McCandless
 Fix For: 2.4, 2.9


 CJKTokenizer have these lines..
 if (ub == 
 Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }
 This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
 counterparts.
 Only 65281-65374 can be converted this way.
 The fix is
  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
  i = 65474  i 65281) {
 /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
 */
 int i = (int) c;
 i = i - 65248;
 c = (char) i;
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: move TrieRange* to core?

2009-03-18 Thread Michael McCandless
Uwe Schindler u...@thetaphi.de wrote:

 If so... maybe we could extend FieldCache's parser to allow it to
 stop-early?  Ie it'd get the TermEnum, iterate through all the full
 precision terms first, asking your parser to convert to long/int,
 and then when your parser sees the very first not-full-precision
 term, it tells FieldCache to stop.

 Would that work?

 Yes, good idea! In this case it is really better, that the higher
 precision terms come first. The question is how to implement that /
 extend the current API.

Maybe, to also allow extensibility for LUCENE-1372, we should let a
parser optionally just do the whole loop?

Ie, you're given an IndexReader  String field, and you return an
int[].

We could eg make an AdvancedIntParser abstract class, implementing
IntParser, and then getInts would check if the parser you passed in is
an instance of AdvancedIntParser, and would just call its getInts
method if so.

It's a bit ugly, because AdvancedIntParser would have to implement a
no-op parseInt.  But it should be back compatible.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Luis Alves (JIRA)
New flexible query parser
-

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves


From New flexible query parser thread by Micheal Busch

in my team at IBM we have used a different query parser than Lucene's in
our products for quite a while. Recently we spent a significant amount
of time in refactoring the code and designing a very generic
architecture, so that this query parser can be easily used for different
products with varying query syntaxes.

This work was originally driven by Andreas Neumann (who, however, left
our team); most of the code was written by Luis Alves, who has been a
bit active in Lucene in the past, and Adriano Campos, who joined our
team at IBM half a year ago. Adriano is Apache committer and PMC member
on the Tuscany project and getting familiar with Lucene now too.

We think this code is much more flexible and extensible than the current
Lucene query parser, and would therefore like to contribute it to
Lucene. I'd like to give a very brief architecture overview here,
Adriano and Luis can then answer more detailed questions as they're much
more familiar with the code than I am.
The goal was it to separate syntax and semantics of a query. E.g. 'a AND
b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
We distinguish the semantics of the different query components, e.g.
whether and how to tokenize/lemmatize/normalize the different terms or
which Query objects to create for the terms. We wanted to be able to
write a parser with a new syntax, while reusing the underlying
semantics, as quickly as possible.
In fact, Adriano is currently working on a 100% Lucene-syntax compatible
implementation to make it easy for people who are using Lucene's query
parser to switch.

The query parser has three layers and its core is what we call the
QueryNodeTree. It is a tree that initially represents the syntax of the
original query, e.g. for 'a AND b':
  AND
 /   \
A B

The three layers are:
1. QueryParser
2. QueryNodeProcessor
3. QueryBuilder

1. The upper layer is the parsing layer which simply transforms the
query text string into a QueryNodeTree. Currently our implementations of
this layer use javacc.
2. The query node processors do most of the work. It is in fact a
configurable chain of processors. Each processors can walk the tree and
modify nodes or even the tree's structure. That makes it possible to
e.g. do query optimization before the query is executed or to tokenize
terms.
3. The third layer is also a configurable chain of builders, which
transform the QueryNodeTree into Lucene Query objects.

Furthermore the query parser uses flexible configuration objects, which
are based on AttributeSource/Attribute. It also uses message classes that
allow to attach resource bundles. This makes it possible to translate
messages, which is an important feature of a query parser.

This design allows us to develop different query syntaxes very quickly.
Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
underlying processors and builders in a few days. We now have a 100%
compatible Lucene query parser, which means the syntax is identical and
all query parser test cases pass on the new one too using a wrapper.


Recent posts show that there is demand for query syntax improvements,
e.g improved range query syntax or operator precedence. There are
already different QP implementations in Lucene+contrib, however I think
we did not keep them all up to date and in sync. This is not too
surprising, because usually when fixes and changes are made to the main
query parser, people don't make the corresponding changes in the contrib
parsers. (I'm guilty here too)
With this new architecture it will be much easier to maintain different
query syntaxes, as the actual code for the first layer is not very much.
All syntaxes would benefit from patches and improvements we make to the
underlying layers, which will make supporting different syntaxes much
more manageable.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683308#action_12683308
 ] 

Luis Alves commented on LUCENE-1567:


Should the Flexible Query Parser patch be committed to the main,
as a replacement for the old queryparser? 

The current implementation is using Java 1.5 syntax.
Is that OK, if we commit it to the trunk.



 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves

 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional 

[jira] Commented: (LUCENE-1567) New flexible query parser

2009-03-18 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683313#action_12683313
 ] 

Adriano Crestani commented on LUCENE-1567:
--

It's probably not ok, since lucene build script will probably fail because of 
that. We are working on a patch which we will upload to this JIRA soon, it will 
only be for the community to review the new query parser code and not to be 
committed against the trunk. I think somebody could create a sandbox and commit 
the code, it would be easier for other to review the new query parser.

I think the right question is if we should include this new parser in the 
release 2.9, if yes, then we definitely need to change the code to be java 1.4 
compatible. Anyway, before taking this decision, the code must be available for 
the community : )

Best Regards,

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves

 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the