subject:"Re\: Payloads and TrieRangeQuery"


Grant Ingersoll wrote:



1. What about Highlighter

I would guess Highlighter has not been updated because its kind of a 
royal * :)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Mark Miller wrote:

Grant Ingersoll wrote:


On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up 
here, that have the same advantages and are more intuitive. We could 
also beef up the documentation and give a better example about how 
to convert a stream/filter from the old to the new API; a 
constructive suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been 
converted.  Also, what are they going to do if the attributes they 
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions 
and offsets, but the analyzer doesn't produce them?  Caveat emptor? 
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are 
meant for downstream TokenFilters, but not necessarily for indexing? 
 Offsets and type come to mind.  Is it the case now that those 
attributes are not automatically added to the index?   If they are 
ignored now, what if I want to add them?  I admit, I'm having a hard 
time finding the code that specifically loops over the Attributes.  I 
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it 
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant
If you are using a CachingTokenFilter, and you do something like pass 
it to something that hasn't upgraded to the new API (say 
MemoryIndex#addField(String fieldName, TokenStream stream, float 
boost)) and you are trying to use the new API,
you will get an exception when trying to read the tokens from the 
CachingTokenFilter a second time - obviously because
the old API is cached rather than the new, and when you try and use 
the new, kak :( .


We can obviously fix anything internal, but not external.

Hmm - actually, even if we fix internal, if you are trying to use the 
old API, you will have the same issue in reverse ;)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Grant Ingersoll wrote:


On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up 
here, that have the same advantages and are more intuitive. We could 
also beef up the documentation and give a better example about how to 
convert a stream/filter from the old to the new API; a constructive 
suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been 
converted.  Also, what are they going to do if the attributes they 
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions 
and offsets, but the analyzer doesn't produce them?  Caveat emptor? 
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are 
meant for downstream TokenFilters, but not necessarily for indexing? 
 Offsets and type come to mind.  Is it the case now that those 
attributes are not automatically added to the index?   If they are 
ignored now, what if I want to add them?  I admit, I'm having a hard 
time finding the code that specifically loops over the Attributes.  I 
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it 
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant
If you are using a CachingTokenFilter, and you do something like pass it 
to something that hasn't upgraded to the new API (say 
MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) 
and you are trying to use the new API,
you will get an exception when trying to read the tokens from the 
CachingTokenFilter a second time - obviously because
the old API is cached rather than the new, and when you try and use the 
new, kak :( .


We can obviously fix anything internal, but not external.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Sounds promising, but I have to think about if there are not 
side-effects of this change other than a slowdown for people who create 
multiple tokens (which would be acceptable as you said, because it's not 
recommended anyway and should be rare).


On 6/15/09 1:46 PM, Uwe Schindler wrote:


Maybe change the deprecation wrapper around next() and next(Token) 
[the default impl of incrementToken()] to check, if the retrieved 
token is not identical to the attribute and then just copy the 
contents to the instance-Token? This would be a slowdown, but only be 
the case for very rare TokenStreams that did not reuse token before 
(and were slow before, too).


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



*From:* Michael Busch [mailto:busch...@gmail.com]
*Sent:* Monday, June 15, 2009 10:39 PM
*To:* java-dev@lucene.apache.org
*Subject:* Re: New Token API was Re: Payloads and TrieRangeQuery

I have implemented most of that actually (the interface part and Token 
implementing all of them).


The problem is a paradigm change with the new API: the assumption is 
that there is always only one single instance of an Attribute. With 
the old API, it is recommended to reuse the passed-in token, but you 
don't have to, you can also return a new one with every call of next().


Now with this change the indexer classes should only know about the 
interfaces, if shouldn't know Token anymore, which seems fine when 
Token implements all those interfaces. BUT, since there can be more 
than once instance of Token, the indexer would have to call 
getAttribute() for all Attributes it needs after each call of next(). 
I haven't measured how expensive that is, but it seems like a severe 
performance hit.


That's basically the main reason why the backwards compatibility is 
ensured in such a goofy way right now.


 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote:


And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending
more time for backwards-compatibility than for the actual new feature! :(
  
I'll try to think about how to simplify this confusing old/new API mix.
 
  
One solution to fix this useNewAPI problem would be to change the

AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.
  
In this case, next() could simply return this Token (which is the all 4

attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.
  
Is this an idea?
  
Uwe
  
  
-

To unsubscribe, e-mail:java-dev-unsubscr...@lucene.apache.org  
<mailto:java-dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail:java-dev-h...@lucene.apache.org  
<mailto:java-dev-h...@lucene.apache.org>

Re: New Token API was Re: Payloads and TrieRangeQuery

yeah about 5 seconds in I saw that and decided to stick with what I know :)

On Mon, Jun 15, 2009 at 5:10 PM, Mark Miller wrote:
> I may do the Highlighter. Its annoying though - I'll have to break back
> compat because Token is part of the public API (Fragmenter, etc).
>
> Robert Muir wrote:
>>
>> Michael OK, I plan on adding some tests for the analyzers that don't have
>> any.
>>
>> I didn't try to migrate things such as highlighter, which are
>> definitely just as important, only because I'm not familiar with that
>> territory.
>>
>> But I think I can figure out what the various language analyzers are
>> trying to do and add tests / convert the remaining ones.
>>
>> On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:
>>
>>>
>>> I agree. It's my fault, the task of changing the contribs (LUCENE-1460)
>>> is
>>> assigned to me for a while now - I just haven't found the time to do it
>>> yet.
>>>
>>> It's great that you started the work on that! I'll try to review the
>>> patch
>>> in the next couple of days and help with fixing the remaining ones. I'd
>>> like
>>> to get the AttributeSource improvements patch out first. I'll try that
>>> tonight.
>>>
>>>  Michael
>>>
>>> On 6/15/09 1:35 PM, Robert Muir wrote:
>>>
>>> Michael, again I am terrible with such things myself...
>>>
>>> Personally I am impressed that you have the back compat, even if you
>>> don't change any code at all I think some reformatting of javadocs
>>> might make the situation a lot friendlier. I just listed everything
>>> that came to my mind immediately.
>>>
>>> I guess I will also mention that one of the reasons I might not use
>>> the new API is that since all filters, etc on the same chain must use
>>> the same API, its discouraging if all the contrib stuff doesn't
>>> support the new API, it makes me want to just stick with the old so
>>> everything will work. So I think contribs being on the new API is
>>> really important otherwise no one will want to use it.
>>>
>>> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
>>>
>>>
>>> This is excellent feedback, Robert!
>>>
>>> I agree this is confusing; especially having a deprecated API and only a
>>> experimental one that replaces the old one. We need to change that.
>>> And I don't like the *useNewAPI*() methods either. I spent a lot of time
>>> thinking about backwards compatibility for this API. It's tricky to do
>>> without sacrificing performance. In API patches I find myself spending
>>> more
>>> time for backwards-compatibility than for the actual new feature! :(
>>>
>>> I'll try to think about how to simplify this confusing old/new API mix.
>>>
>>> However, we need to make the decisions:
>>> a) if we want to release this new API with 2.9,
>>> b) if yes to a), if we want to remove the old API in 3.0?
>>>
>>> If yes to a) and no to b), then we'll have to support both APIs for a
>>> presumably very long time, so we then need to have a better solution for
>>> the
>>> backwards-compatibility here.
>>>
>>> -Michael
>>>
>>> On 6/15/09 1:09 PM, Robert Muir wrote:
>>>
>>> let me try some slightly more constructive feedback:
>>>
>>> new user looks at TokenStream javadocs:
>>>
>>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
>>> immediately they see deprecated, text in red with the words
>>> "experimental", warnings in bold, the whole thing is scary!
>>> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
>>> in a bad way, and its probably the most important method to a new
>>> user!
>>> there's also a stray bold tag gone haywire somewhere, possibly
>>> .incrementToken()
>>>
>>> from a technical perspective, the documentation is excellent! but for
>>> a new user unfamiliar with lucene, its unclear exactly what steps to
>>> take: use the scary red experimental api or the old deprecated one?
>>>
>>> theres also some fairly advanced stuff such as .captureState and
>>> .restoreState that might be better in a different place.
>>>
>>> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
>>> [one is static, one is not], especially because it states all streams
>>> and filters in one chain must use the same API, is there a way to
>>> simplify this?
>>>
>>> i'm really terrible with javadocs myself, but perhaps we can come up
>>> with a way to improve the presentation... maybe that will make the
>>> difference.
>>>
>>> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>>>
>>>
>>> Mark, I'll see if I can get tests produced for some of those analyzers.
>>>
>>> as a new user of the new api myself, I think I can safely say the most
>>> confusing thing about it is having the old deprecated API mixed in the
>>> javadocs with it :)
>>>
>>> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller
>>> wrote:
>>>
>>>
>>> Robert Muir wrote:
>>>
>>>
>>> Mark, I created an issue for this.
>>>
>>>
>>>
>>> Thanks Robert, great idea.
>>>
>>>
>>> I just think you know, converting an analyzer to the new api is really
>>> n

Some SVN cleanup, was: New Token API was Re: Payloads and TrieRangeQuery

Done, tests pass. 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, June 15, 2009 10:40 PM
> To: java-dev@lucene.apache.org
> Subject: Re: New Token API was Re: Payloads and TrieRangeQuery
> 
> On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote:
> 
> > And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
> > inside should be in test/o/a/l/store. Should I move?
> 
> Please do!
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

I may do the Highlighter. Its annoying though - I'll have to break back
compat because Token is part of the public API (Fragmenter, etc).

Robert Muir wrote:

Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:

I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
assigned to me for a while now - I just haven't found the time to do it yet.

It's great that you started the work on that! I'll try to review the patch
in the next couple of days and help with fixing the remaining ones. I'd like
to get the AttributeSource improvements patch out first. I'll try that
tonight.

Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:

This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending more
time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly
.incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:

Robert Muir wrote:

Mark, I created an issue for this.

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.

I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelat

RE: New Token API was Re: Payloads and TrieRangeQuery

Maybe change the deprecation wrapper around next() and next(Token) [the
default impl of incrementToken()] to check, if the retrieved token is not
identical to the attribute and then just copy the contents to the
instance-Token? This would be a slowdown, but only be the case for very rare
TokenStreams that did not reuse token before (and were slow before, too).

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Michael Busch [mailto:busch...@gmail.com] 
Sent: Monday, June 15, 2009 10:39 PM
To: java-dev@lucene.apache.org
Subject: Re: New Token API was Re: Payloads and TrieRangeQuery

 

I have implemented most of that actually (the interface part and Token
implementing all of them).

The problem is a paradigm change with the new API: the assumption is that
there is always only one single instance of an Attribute. With the old API,
it is recommended to reuse the passed-in token, but you don't have to, you
can also return a new one with every call of next().

Now with this change the indexer classes should only know about the
interfaces, if shouldn't know Token anymore, which seems fine when Token
implements all those interfaces. BUT, since there can be more than once
instance of Token, the indexer would have to call getAttribute() for all
Attributes it needs after each call of next(). I haven't measured how
expensive that is, but it seems like a severe performance hit.

That's basically the main reason why the backwards compatibility is ensured
in such a goofy way right now.

 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote: 

And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(
 
I'll try to think about how to simplify this confusing old/new API mix.


 
One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.
 
In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.
 
Is this an idea?
 
Uwe
 
 
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:
> I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
> assigned to me for a while now - I just haven't found the time to do it yet.
>
> It's great that you started the work on that! I'll try to review the patch
> in the next couple of days and help with fixing the remaining ones. I'd like
> to get the AttributeSource improvements patch out first. I'll try that
> tonight.
>
>  Michael
>
> On 6/15/09 1:35 PM, Robert Muir wrote:
>
> Michael, again I am terrible with such things myself...
>
> Personally I am impressed that you have the back compat, even if you
> don't change any code at all I think some reformatting of javadocs
> might make the situation a lot friendlier. I just listed everything
> that came to my mind immediately.
>
> I guess I will also mention that one of the reasons I might not use
> the new API is that since all filters, etc on the same chain must use
> the same API, its discouraging if all the contrib stuff doesn't
> support the new API, it makes me want to just stick with the old so
> everything will work. So I think contribs being on the new API is
> really important otherwise no one will want to use it.
>
> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
>
>
> This is excellent feedback, Robert!
>
> I agree this is confusing; especially having a deprecated API and only a
> experimental one that replaces the old one. We need to change that.
> And I don't like the *useNewAPI*() methods either. I spent a lot of time
> thinking about backwards compatibility for this API. It's tricky to do
> without sacrificing performance. In API patches I find myself spending more
> time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.
>
> However, we need to make the decisions:
> a) if we want to release this new API with 2.9,
> b) if yes to a), if we want to remove the old API in 3.0?
>
> If yes to a) and no to b), then we'll have to support both APIs for a
> presumably very long time, so we then need to have a better solution for the
> backwards-compatibility here.
>
> -Michael
>
> On 6/15/09 1:09 PM, Robert Muir wrote:
>
> let me try some slightly more constructive feedback:
>
> new user looks at TokenStream javadocs:
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
> immediately they see deprecated, text in red with the words
> "experimental", warnings in bold, the whole thing is scary!
> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
> in a bad way, and its probably the most important method to a new
> user!
> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()
>
> from a technical perspective, the documentation is excellent! but for
> a new user unfamiliar with lucene, its unclear exactly what steps to
> take: use the scary red experimental api or the old deprecated one?
>
> theres also some fairly advanced stuff such as .captureState and
> .restoreState that might be better in a different place.
>
> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
> [one is static, one is not], especially because it states all streams
> and filters in one chain must use the same API, is there a way to
> simplify this?
>
> i'm really terrible with javadocs myself, but perhaps we can come up
> with a way to improve the presentation... maybe that will make the
> difference.
>
> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>
>
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>
>
> Robert Muir wrote:
>
>
> Mark, I created an issue for this.
>
>
>
> Thanks Robert, great idea.
>
>
> I just think you know, converting an analyzer to the new api is really
> not that bad.
>
>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>
>
> reverse engineering what one of them does is not necessarily obvi

Re: New Token API was Re: Payloads and TrieRangeQuery

I agree. It's my fault, the task of changing the contribs (LUCENE-1460)
is assigned to me for a while now - I just haven't found the time to do
it yet.

It's great that you started the work on that! I'll try to review the
patch in the next couple of days and help with fixing the remaining
ones. I'd like to get the AttributeSource improvements patch out first.
I'll try that tonight.

Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:

This is excellent feedback, Robert!

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:

Robert Muir wrote:

Mark, I created an issue for this.

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:

Robert Muir wrote:

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.

hope this doesn't sound like a complaint

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael McCandless

On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote:

> And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
> inside should be in test/o/a/l/store. Should I move?

Please do!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

I have implemented most of that actually (the interface part and Token 
implementing all of them).


The problem is a paradigm change with the new API: the assumption is 
that there is always only one single instance of an Attribute. With the 
old API, it is recommended to reuse the passed-in token, but you don't 
have to, you can also return a new one with every call of next().


Now with this change the indexer classes should only know about the 
interfaces, if shouldn't know Token anymore, which seems fine when Token 
implements all those interfaces. BUT, since there can be more than once 
instance of Token, the indexer would have to call getAttribute() for all 
Attributes it needs after each call of next(). I haven't measured how 
expensive that is, but it seems like a severe performance hit.


That's basically the main reason why the backwards compatibility is 
ensured in such a goofy way right now.


 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote:

And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending
more time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.
 


One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
> This is excellent feedback, Robert!
>
> I agree this is confusing; especially having a deprecated API and only a
> experimental one that replaces the old one. We need to change that.
> And I don't like the *useNewAPI*() methods either. I spent a lot of time
> thinking about backwards compatibility for this API. It's tricky to do
> without sacrificing performance. In API patches I find myself spending more
> time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.
>
> However, we need to make the decisions:
> a) if we want to release this new API with 2.9,
> b) if yes to a), if we want to remove the old API in 3.0?
>
> If yes to a) and no to b), then we'll have to support both APIs for a
> presumably very long time, so we then need to have a better solution for the
> backwards-compatibility here.
>
> -Michael
>
> On 6/15/09 1:09 PM, Robert Muir wrote:
>
> let me try some slightly more constructive feedback:
>
> new user looks at TokenStream javadocs:
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
> immediately they see deprecated, text in red with the words
> "experimental", warnings in bold, the whole thing is scary!
> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
> in a bad way, and its probably the most important method to a new
> user!
> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()
>
> from a technical perspective, the documentation is excellent! but for
> a new user unfamiliar with lucene, its unclear exactly what steps to
> take: use the scary red experimental api or the old deprecated one?
>
> theres also some fairly advanced stuff such as .captureState and
> .restoreState that might be better in a different place.
>
> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
> [one is static, one is not], especially because it states all streams
> and filters in one chain must use the same API, is there a way to
> simplify this?
>
> i'm really terrible with javadocs myself, but perhaps we can come up
> with a way to improve the presentation... maybe that will make the
> difference.
>
> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>
>
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>
>
> Robert Muir wrote:
>
>
> Mark, I created an issue for this.
>
>
>
> Thanks Robert, great idea.
>
>
> I just think you know, converting an analyzer to the new api is really
> not that bad.
>
>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>
>
> reverse engineering what one of them does is not necessarily obvious,
> and is completely unrelated but necessary if they are to be migrated.
>
> I'd be willing to assist with some of this but I don't want to really
> work the issue if its gonna be a waste of time at the end of the
> day...
>
>
>
> The chances of this issue being fully reverted are so remote that I really
> wouldnt let that stop you ...
>
>
> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>
>
>
> Robert Muir wrote:
>
>
>
> As Lucene's contrib hasn't been fully converted either (and its been
> quite
> some time now), someone has probably heard that groan before.
>
>
>
>
> hope this doesn't sound like a complaint,
>
>
>
> Complaints are fine in any case. Every now and then, it might cause a
> little
> rant from me or something, but please don't let that dissuade you :)
> Who doesnt like to rant and rave now and then. As long as thoughts and
> opinions are coming out in a no

RE: New Token API was Re: Payloads and TrieRangeQuery

> And I don't like the *useNewAPI*() methods either. I spent a lot of time 
> thinking about backwards compatibility for this API. It's tricky to do 
> without sacrificing performance. In API patches I find myself spending 
> more time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.

One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

By the way, there is an empty "de" subdir in SVN inside analysis. Can this
be removed?

And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
inside should be in test/o/a/l/store. Should I move?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Monday, June 15, 2009 10:18 PM
> To: java-dev@lucene.apache.org
> Subject: RE: New Token API was Re: Payloads and TrieRangeQuery
> 
> > there's also a stray bold tag gone haywire somewhere, possibly
> > .incrementToken()
> 
> I fixed this. This was going me on my nerves the whole day when I wrote
> javadocs for NumericTokenStream...
> 
> Uwe
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a 
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(


I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a 
presumably very long time, so we then need to have a better solution for 
the backwards-compatibility here.


-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir  wrote:
   

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller  wrote:
 

Robert Muir wrote:
   

Mark, I created an issue for this.

 

Thanks Robert, great idea.
   

I just think you know, converting an analyzer to the new api is really
not that bad.

 

I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
   

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

 

The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
   

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller  wrote:

 

Robert Muir wrote:

   

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


   

hope this doesn't sound like a complaint,

 

Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

   

  but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.


 

Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com

Re: New Token API was Re: Payloads and TrieRangeQuery

Some great points - especially the decision between a deprecated API, 
and a new experimental one subject to change. Bit of a rock and a hard 
place for a new user.


Perhaps we should add a little note with some guidance.


- Mark

Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
  

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:


Robert Muir wrote:
  

Mark, I created an issue for this.



Thanks Robert, great idea.
  

I just think you know, converting an analyzer to the new api is really
not that bad.



I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
  

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...



The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
  

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:



Robert Muir wrote:

  

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


  

hope this doesn't sound like a complaint,



Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

  

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.




Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  


--
Robert Muir
rcm...@gmail.com






  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For ad

RE: New Token API was Re: Payloads and TrieRangeQuery

> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()

I fixed this. This was going me on my nerves the whole day when I wrote
javadocs for NumericTokenStream...

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>> Robert Muir wrote:
>>>
>>> Mark, I created an issue for this.
>>>
>>
>> Thanks Robert, great idea.
>>>
>>> I just think you know, converting an analyzer to the new api is really
>>> not that bad.
>>>
>>
>> I don't either. I'm really just complaining about the initial readability.
>> Once you know whats up, its not too much different. I just have found myself
>> having to refigure out whats up (a short task to be sure) over again after I
>> leave it for a while. With the old one, everything was just kind of
>> immediately self evident.
>>
>> That makes me think new users might be a little more confused when they
>> first meet again. I'm not a new user though, so its only a guess really.
>>>
>>> reverse engineering what one of them does is not necessarily obvious,
>>> and is completely unrelated but necessary if they are to be migrated.
>>>
>>> I'd be willing to assist with some of this but I don't want to really
>>> work the issue if its gonna be a waste of time at the end of the
>>> day...
>>>
>>
>> The chances of this issue being fully reverted are so remote that I really
>> wouldnt let that stop you ...
>>>
>>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>>>

 Robert Muir wrote:

>>
>> As Lucene's contrib hasn't been fully converted either (and its been
>> quite
>> some time now), someone has probably heard that groan before.
>>
>>
>
> hope this doesn't sound like a complaint,
>

 Complaints are fine in any case. Every now and then, it might cause a
 little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 complaints),
 I think its all good.

>
>  but in my opinion this is
> because many do not have any tests.
> I converted a few of these and its just grunt work but if there are no
> tests, its impossible to verify the conversion is correct.
>
>

 Thanks for pointing that out. We probably get lazy with tests, especially
 in
 contrib, and this brings up a good point - we should probably push
 for tests or write them before committing more often. Sometimes I'm sure
 it
 just comes downto a tradeoff though - no resources at the time,
 the class looked clear cut, and it was just contrib anyway. But then here
 we
 are ... a healthy dose of grunt work is bad enough when you have tests to
 check it.

 --
 - Mark

 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

>>>
>>>
>>>
>>>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
F

Re: New Token API was Re: Payloads and TrieRangeQuery

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
> Robert Muir wrote:
>>
>> Mark, I created an issue for this.
>>
>
> Thanks Robert, great idea.
>>
>> I just think you know, converting an analyzer to the new api is really
>> not that bad.
>>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>>
>> reverse engineering what one of them does is not necessarily obvious,
>> and is completely unrelated but necessary if they are to be migrated.
>>
>> I'd be willing to assist with some of this but I don't want to really
>> work the issue if its gonna be a waste of time at the end of the
>> day...
>>
>
> The chances of this issue being fully reverted are so remote that I really
> wouldnt let that stop you ...
>>
>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>>
>>>
>>> Robert Muir wrote:
>>>
>
> As Lucene's contrib hasn't been fully converted either (and its been
> quite
> some time now), someone has probably heard that groan before.
>
>

 hope this doesn't sound like a complaint,

>>>
>>> Complaints are fine in any case. Every now and then, it might cause a
>>> little
>>> rant from me or something, but please don't let that dissuade you :)
>>> Who doesnt like to rant and rave now and then. As long as thoughts and
>>> opinions are coming out in a non negative way (which certainly includes
>>> complaints),
>>> I think its all good.
>>>

  but in my opinion this is
 because many do not have any tests.
 I converted a few of these and its just grunt work but if there are no
 tests, its impossible to verify the conversion is correct.


>>>
>>> Thanks for pointing that out. We probably get lazy with tests, especially
>>> in
>>> contrib, and this brings up a good point - we should probably push
>>> for tests or write them before committing more often. Sometimes I'm sure
>>> it
>>> just comes downto a tradeoff though - no resources at the time,
>>> the class looked clear cut, and it was just contrib anyway. But then here
>>> we
>>> are ... a healthy dose of grunt work is bad enough when you have tests to
>>> check it.
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> If you understood that, you'd be able to look
> at the actual token value if you were interested in what shift was
> used.  So it's redundant, has a runtime cost, it's not currently used
> anywhere, and it's not useful to fields other than Trie.  Perhaps it
> shouldn't exist (yet)?

You are right, you could also decode the shift value from the first char of
the token... I think, I will remove the ShiftAttribute and only set the
TermType to highest, lower precisions. By this, one could easily add a
payload to the real numeric value using a TokenFilter.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote:
> > There is a new Attribute called ShiftAttribute (or
> NumericShiftAttribute),
> > when trie range is moved to core. This attribute contains the shifted-
> away
> > bits from the prefix encoded value during trie indexing.
> 
> I was wondering about this
> To make use of ShiftAttribute, you need to understand the trie
> encoding scheme itself.  If you understood that, you'd be able to look
> at the actual token value if you were interested in what shift was
> used.  So it's redundant, has a runtime cost, it's not currently used
> anywhere, and it's not useful to fields other than Trie.  Perhaps it
> shouldn't exist (yet)?

The idea was to make the indexing process controllable. You were the one,
who asked e.g. for the possibility to add payloads to trie fields and so on.
Using the shift attribute, you have full control of the token types. OK,
it's a little bit redundant; you could also use the TypeAttribute (which is
already used to mark highest precision and lower precision values).

One question about the whole TokenStream: In the original case we discussed
about Payloads/Position and TrieRange. If this would be implemented in
future versions, the question is, how should I set the
PositionIncrement/Offsets in the token stream to create a Position of 0 in
the index. I do not understand the indexing process here, especially this
deprecated boolean flag about something negative (not sure what the name
was). Should I set PositionIncrement to 0 for all Trie fields per default.
How about PositionIncrementGap, when indexing more than one field? All not
really clear. The position would be simplier to implement, but doing this
with an attribute, that is indexes together with the other attributes like a
payload would be the most ideal solution for future versions of TrieRange.

(Maybe we could also use the Offset attribute for the highest precision
bits)

Uwe

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Yonik Seeley

On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote:
> There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
> when trie range is moved to core. This attribute contains the shifted-away
> bits from the prefix encoded value during trie indexing.

I was wondering about this
To make use of ShiftAttribute, you need to understand the trie
encoding scheme itself.  If you understood that, you'd be able to look
at the actual token value if you were interested in what shift was
used.  So it's redundant, has a runtime cost, it's not currently used
anywhere, and it's not useful to fields other than Trie.  Perhaps it
shouldn't exist (yet)?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> Also, what about the case where one might have attributes that are meant
> for downstream TokenFilters, but not necessarily for indexing?  Offsets 
> and type come to mind.  Is it the case now that those attributes are not 
> automatically added to the index?   If they are ignored now, what if I 
> want to add them?  I admit, I'm having a hard time finding the code that 
> specifically loops over the Attributes.  I recall seeing it, but can no 
> longer find it.

There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
when trie range is moved to core. This attribute contains the shifted-away
bits from the prefix encoded value during trie indexing. The idea is to e.g.
have TokenFilters that may additional payloads or others to trie values, but
only do this for specific precisions. In future, it may also be interesting
to automatically add this attribute to the index.

Maybe we should add a read/store method to attributes, that adds an
attribute to the Posting using a IndexOutput/IndexInput (like the
serialization methods).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Robert Muir wrote:

Mark, I created an issue for this.
  

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.
  
I don't either. I'm really just complaining about the initial 
readability. Once you know whats up, its not too much different. I just 
have found myself
having to refigure out whats up (a short task to be sure) over again 
after I leave it for a while. With the old one, everything was just kind 
of immediately self evident.


That makes me think new users might be a little more confused when they 
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...
  
The chances of this issue being fully reverted are so remote that I 
really wouldnt let that stop you ...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
  

Robert Muir wrote:


As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
  

Complaints are fine in any case. Every now and then, it might cause a little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.


 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

  

Thanks for pointing that out. We probably get lazy with tests, especially in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org







  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Mark, I created an issue for this.

I just think you know, converting an analyzer to the new api is really
not that bad.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
> Robert Muir wrote:
>>>
>>> As Lucene's contrib hasn't been fully converted either (and its been
>>> quite
>>> some time now), someone has probably heard that groan before.
>>>
>>
>> hope this doesn't sound like a complaint,
>
> Complaints are fine in any case. Every now and then, it might cause a little
> rant from me or something, but please don't let that dissuade you :)
> Who doesnt like to rant and rave now and then. As long as thoughts and
> opinions are coming out in a non negative way (which certainly includes
> complaints),
> I think its all good.
>>
>>  but in my opinion this is
>> because many do not have any tests.
>> I converted a few of these and its just grunt work but if there are no
>> tests, its impossible to verify the conversion is correct.
>>
>
> Thanks for pointing that out. We probably get lazy with tests, especially in
> contrib, and this brings up a good point - we should probably push
> for tests or write them before committing more often. Sometimes I'm sure it
> just comes downto a tradeoff though - no resources at the time,
> the class looked clear cut, and it was just contrib anyway. But then here we
> are ... a healthy dose of grunt work is bad enough when you have tests to
> check it.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Robert Muir wrote:

As Lucene's contrib hasn't been fully converted either (and its been quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
Complaints are fine in any case. Every now and then, it might cause a 
little rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and 
opinions are coming out in a non negative way (which certainly includes 
complaints),

I think its all good.

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.
  
Thanks for pointing that out. We probably get lazy with tests, 
especially in contrib, and this brings up a good point - we should 
probably push
for tests or write them before committing more often. Sometimes I'm sure 
it just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then 
here we are ... a healthy dose of grunt work is bad enough when you have 
tests to check it.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll



On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up  
here, that have the same advantages and are more intuitive. We could  
also beef up the documentation and give a better example about how  
to convert a stream/filter from the old to the new API; a  
constructive suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been  
converted.  Also, what are they going to do if the attributes they  
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions  
and offsets, but the analyzer doesn't produce them?  Caveat emptor?  
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are  
meant for downstream TokenFilters, but not necessarily for indexing?   
Offsets and type come to mind.  Is it the case now that those  
attributes are not automatically added to the index?   If they are  
ignored now, what if I want to add them?  I admit, I'm having a hard  
time finding the code that specifically loops over the Attributes.  I  
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it  
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant

Re: New Token API was Re: Payloads and TrieRangeQuery

>
> As Lucene's contrib hasn't been fully converted either (and its been quite
> some time now), someone has probably heard that groan before.

hope this doesn't sound like a complaint, but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Yonik Seeley wrote:

The high-level description of the new API looks good (being able to
add arbitrary properties to tokens), unfortunately, I've never had the
time to try and use it and give any constructive feedback.

As far as difficulty of use, I assume this only applies to
implementing your own TokenFilter? It seems like most standard users
would be just stringing together existing TokenFilters to create
custom Analyzers?

-Yonik
http://www.lucidimagination.com

  
True - its the implementation. And just trying to understand whats going 
on the first time you see it.


Its not particularly difficult, but its also not obvious like the 
previous API was. As a user, I would ask why that is so, and frankly the 
answer wouldn't do much for me (as a user).


I don't know if most 'standard' users implement their own or not. I will 
say, and perhaps I was in a special situation, I was writing them and 
modifying them almost as soon
as I started playing with Lucene. And even when I wasnt, I needed to 
understand the code to understand some of the complexities that could 
occur, and thankfully, that was breezy to do.


Right now, if you told me to go convert all of Solr to the new API you 
would hear a mighty groan.


As Lucene's contrib hasn't been fully converted either (and its been 
quite some time now), someone has probably heard that groan before.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Yonik Seeley

The high-level description of the new API looks good (being able to
add arbitrary properties to tokens), unfortunately, I've never had the
time to try and use it and give any constructive feedback.

As far as difficulty of use, I assume this only applies to
implementing your own TokenFilter? It seems like most standard users
would be just stringing together existing TokenFilters to create
custom Analyzers?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll

On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:
I'm not sure why this (currently having to implement next() too) is
such an issue for you. You brought it up at the Lucene meetup too.
No user will ever have to implement both (the new API and the old)
in their streams/filters. The only reason why we did it this way is
to not sacrifice performance for existing streams/filters when
people switch to Lucene 2.9. I explained this point in the jira issue:

http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881

It's an issue b/c I don't like writing dead code and who knows when
2.9 will actually be out.

I don't think it is a show stopper either.

Add on top of it, that the whole point of customizing the chain is
to use it in search and, frankly speaking, somehow I think that
part of the patch was held back.

I'm not sure what you're implying. Could you elaborate?

Sorry, see my response to Michael M. on this. I didn't mean to imply
you were doing something malicious, just that it always felt half done
to me. Knowing you, you don't strike me as someone who does things
half way, so that's why I felt it was held back. But, as Michael M
reminded me, it is complex, so please accept my apologies.

I am not sure I agree here. Forcing people to upgrade their analyzers
can be quite involved. Analyzers are one of the main areas that
people do custom work. Solr, for instance, has 11 custom TokenFilters
right now as well as custom Tokenizers, not too mention the ones used
during testing that aren't shipped. Upgrading these is a lot of
work. I know in previous jobs, I also maintained a fair number
TokenStream related stuff. This should not be underestimated.
Furthermore, as I said back in the initial discussion, Lucene's
Analyzer stuff is often used outside of Lucene.

In fact, I often think the Analysis piece should be a standalone jar
(not requiring core) and that core should have a dependency on it. In
other words, move o.a.l.analysis (and contrib/analsis) to a standalone
module that core depends on. This would make it easier for others to
consume the Analysis functionality.

I personally would vote for reverting until a complete patch that
addresses both sides of the problem is submitted and a better
solution to cloning is put forth.

If we revert now and put a new flexible API like this into 3.x,
which I think is necessary to utilize flexible indexing, then we'll
have to wait until 4.0 before we can remove the old API.
Disadvantages like the one you mentioned above, will then probably
be present much longer.

I mentioned in the following thread that I have started working on a
better way of cloning, which will actually be faster compared to the
old API. I'll try to get the code out asap.

http://markmail.org/message/q7pgh2qlm2w7cxfx

I'd be happy to discuss other API proposals that anybody brings up
here, that have the same advantages and are more intuitive. We could
also beef up the documentation and give a better example about how
to convert a stream/filter from the old to the new API; a
constructive suggestion that Uwe made at the ApacheCon.

My point here was, at the time, that if others wanted to revert, I
probably would vote for it. I'm not proposing we do it, as I think we
can make do with what we have. Given the discussion here, I would
probably change my mind and not support it now.

I think it might be helpful to have some help for people upgrading.
Perhaps an abstract class that provides the "core" Token attributes
out of the box as a base class that they can then extend? That being
said, forcing people to upgrade could at least help them think about
the fact that they have no use for the Type attribute or the Offsets
attributes. And, testing the cloning stuff would help. I think the
current approach underestimates the number of people who need to
buffer tokens in memory before handing them out. Sure, it's not as
many as the main use case, but it's not zero either.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll



On Jun 15, 2009, at 12:19 PM, Michael McCandless wrote:


I don't think anything was "held back" in this effort. Grant, are you
referring to LUCENE-1458?  That's "held back" simply because the only
person working on it (me) got distracted by other things to work on.


I'm sorry, I didn't mean to imply Michael B. was holding back on the  
work.  The patch has always felt half done to me because what's the  
point of having all of these attributes in the index if you don't have  
anyway of searching them, thus I was struck by the need to get it in  
prior to making it available in search.I realize it's complex, but  
here we are forcing people to upgrade for some future, long term goal.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael McCandless

I thought the primary goal of switching to AttributeSource (yes, the
name is very generic...) was to allow extensibility to what's created
per-Token, so that an app could add their own attrs without costly
subclassing/casting per Token, independent of other other "things"
adding their tokens, etc.

EG, trie* takes advantage of this extensibility by adding a
ShiftAttribute.

Subclassing Token in your app wasn't a good solution for various
reasons.

I do think the API is somewhat more cumbersome than before, and I
don't like that about it (consumability!).

But net/net I think the change is good, and it's one of the
baby steps for flexible indexing (bullet #11):

  http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

Ie it addresses the flexibility during analysis.

I don't think anything was "held back" in this effort. Grant, are you
referring to LUCENE-1458?  That's "held back" simply because the only
person working on it (me) got distracted by other things to work on.

Flexible indexing (all of bullet #11) is a complex project, and we
need to break it into baby steps like this one.  We've already made
good progress on it: you can already make custom attrs and a custom
(but, package private) indexing chain if you want.

Next step is pluggable codecs for writing index files (LUCENE-1458),
and APIs for reading them (that generalize Terms/TermDoc/TermPositions
we have today).

Mike

On Sun, Jun 14, 2009 at 11:41 PM, Shai Erera wrote:
> The "old" API is deprecated, and therefore when we release 2.9 there might
> be some people who'd think they should move away from it, to better prepare
> for 3.0 (while in fact this many not be the case). Also, we should make sure
> that when we remove all the deprecations, this will still exist (and
> therefore, why deprecate it now?), if we think this should indeed be kept
> around for at least a while longer.
>
> I personally am all for keeping it around (it will save me a huge
> refactoring of an Analyzer package I wrote), but I have to admit it's only
> because I've got quite comfortable with the existing API, and did not have
> the time to try the new one yet.
>
> Shai
>
> On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller  wrote:
>>
>> Mark Miller wrote:
>>>
>>> I don't know how I feel about rolling the new token api back.
>>>
>>> I will say that I originally had no issue with it because I am very
>>> excited about Lucene-1458.
>>>
>>> At the same time though, I'm thinking Lucene-1458 is a very advanced
>>> issue that will likely be for really expert usage (though I can see benefits
>>> falling to general users).
>>>
>>> I'm slightly iffy about making an intuitive api much less intuitive for
>>> an expert future feature that hasn't fully materialized in Lucene yet. It
>>> almost seems like that fight should weigh towards general usage and standard
>>> users.
>>>
>>> I don't have a better proposal though, nor the time to consider it at the
>>> moment. I was just more curious if anyone else had any thoughts. I hadn't
>>> realized Grant had asked a similar question not long ago
>>> with no response. Not sure how to take that, but I'd think that would
>>> indicate less problems with people than more. On the other hand, you don't
>>> have to switch yet (with trunk) and we have yet to release it. I wonder how
>>> many non dev, every day users have really had to tussle with the new API
>>> yet. Not many people complaining too loudly at the moment though.
>>>
>>> Asking for a roll back seems a bit extreme without a little more support
>>> behind it than we have seen.
>>>
>>> - Mark
>>
>> PS
>>
>> I know you didnt ask for a rollback Grant - just kind of talking in a
>> general manner. I see your point on getting the search side in, I'm just not
>> sure I agree that it really matters if one hits before the other. Like Mike
>> says, you don't
>> have to switch to the new API yet.
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-14 Thread Shai Erera

The "old" API is deprecated, and therefore when we release 2.9 there might
be some people who'd think they should move away from it, to better prepare
for 3.0 (while in fact this many not be the case). Also, we should make sure
that when we remove all the deprecations, this will still exist (and
therefore, why deprecate it now?), if we think this should indeed be kept
around for at least a while longer.

I personally am all for keeping it around (it will save me a huge
refactoring of an Analyzer package I wrote), but I have to admit it's only
because I've got quite comfortable with the existing API, and did not have
the time to try the new one yet.

Shai

On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller  wrote:

> Mark Miller wrote:
>
>> I don't know how I feel about rolling the new token api back.
>>
>> I will say that I originally had no issue with it because I am very
>> excited about Lucene-1458.
>>
>> At the same time though, I'm thinking Lucene-1458 is a very advanced issue
>> that will likely be for really expert usage (though I can see benefits
>> falling to general users).
>>
>> I'm slightly iffy about making an intuitive api much less intuitive for an
>> expert future feature that hasn't fully materialized in Lucene yet. It
>> almost seems like that fight should weigh towards general usage and standard
>> users.
>>
>> I don't have a better proposal though, nor the time to consider it at the
>> moment. I was just more curious if anyone else had any thoughts. I hadn't
>> realized Grant had asked a similar question not long ago
>> with no response. Not sure how to take that, but I'd think that would
>> indicate less problems with people than more. On the other hand, you don't
>> have to switch yet (with trunk) and we have yet to release it. I wonder how
>> many non dev, every day users have really had to tussle with the new API
>> yet. Not many people complaining too loudly at the moment though.
>>
>> Asking for a roll back seems a bit extreme without a little more support
>> behind it than we have seen.
>>
>> - Mark
>>
>
> PS
>
> I know you didnt ask for a rollback Grant - just kind of talking in a
> general manner. I see your point on getting the search side in, I'm just not
> sure I agree that it really matters if one hits before the other. Like Mike
> says, you don't
> have to switch to the new API yet.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-14 Thread Mark Miller


Mark Miller wrote:

I don't know how I feel about rolling the new token api back.

I will say that I originally had no issue with it because I am very 
excited about Lucene-1458.


At the same time though, I'm thinking Lucene-1458 is a very advanced 
issue that will likely be for really expert usage (though I can see 
benefits falling to general users).


I'm slightly iffy about making an intuitive api much less intuitive 
for an expert future feature that hasn't fully materialized in Lucene 
yet. It almost seems like that fight should weigh towards general 
usage and standard users.


I don't have a better proposal though, nor the time to consider it at 
the moment. I was just more curious if anyone else had any thoughts. I 
hadn't realized Grant had asked a similar question not long ago
with no response. Not sure how to take that, but I'd think that would 
indicate less problems with people than more. On the other hand, you 
don't have to switch yet (with trunk) and we have yet to release it. I 
wonder how many non dev, every day users have really had to tussle 
with the new API yet. Not many people complaining too loudly at the 
moment though.


Asking for a roll back seems a bit extreme without a little more 
support behind it than we have seen.


- Mark


PS

I know you didnt ask for a rollback Grant - just kind of talking in a 
general manner. I see your point on getting the search side in, I'm just 
not sure I agree that it really matters if one hits before the other. 
Like Mike says, you don't

have to switch to the new API yet.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-14 Thread Mark Miller


I don't know how I feel about rolling the new token api back.

I will say that I originally had no issue with it because I am very 
excited about Lucene-1458.


At the same time though, I'm thinking Lucene-1458 is a very advanced 
issue that will likely be for really expert usage (though I can see 
benefits falling to general users).


I'm slightly iffy about making an intuitive api much less intuitive for 
an expert future feature that hasn't fully materialized in Lucene yet. 
It almost seems like that fight should weigh towards general usage and 
standard users.


I don't have a better proposal though, nor the time to consider it at 
the moment. I was just more curious if anyone else had any thoughts. I 
hadn't realized Grant had asked a similar question not long ago
with no response. Not sure how to take that, but I'd think that would 
indicate less problems with people than more. On the other hand, you 
don't have to switch yet (with trunk) and we have yet to release it. I 
wonder how many non dev, every day users have really had to tussle with 
the new API yet. Not many people complaining too loudly at the moment 
though.


Asking for a roll back seems a bit extreme without a little more support 
behind it than we have seen.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-14 Thread Michael Busch

On 6/14/09 5:17 AM, Grant Ingersoll wrote:
Agreed. I've been bringing it up for a while now and made the same
comments when it was first introduced, but felt like the lone voice in
the wilderness on it and gave way [1], [2], [3]. Now that others are
writing/converting, I think it is worth revisiting.

I am and always was open to constructive suggestions about how to design
this API. I know these new APIs currently don't seem to have many
advantages over the previous ones, but they're basically laying the API
groundwork for future features like flexible indexing. Some concerns you
mentioned were targeted against the first version of the patch in
LUCENE-1422. But, you later said you liked how the next patch looked (in
thread [2] that you mentioned).

That being said, I did just write my first TokenFilter with it, and
didn't think it was that hard. There are some gains in it and the API
can be simpler if you just need one or two attributes (see
DelimitedPayloadTokenFilter), although, just like the move to using
char [] in Token, as soon as you do something like store a Token, you
lose most of the benefit, I think (for the char [] case, as soon as
you need a String in one of your filters, you lose the perf. gain).
The annoying parts are that you still have to implement the deprecated
next() part, otherwise chances are the thing is unusable by everyone
at this point anyway.

I'm not sure why this (currently having to implement next() too) is such
an issue for you. You brought it up at the Lucene meetup too. No user
will ever have to implement both (the new API and the old) in their
streams/filters. The only reason why we did it this way is to not
sacrifice performance for existing streams/filters when people switch to
Lucene 2.9. I explained this point in the jira issue:

http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881

The only time when we'll ever have to implement both APIs is between now
and 2.9, only for new streams and filters that we add before 2.9 is
released. I don't think it'd be reasonable to consider this disadvantage
as a show stopper.
Add on top of it, that the whole point of customizing the chain is to
use it in search and, frankly speaking, somehow I think that part of
the patch was held back.

I'm not sure what you're implying. Could you elaborate?

The search side of the API is currently being developed in Lucene-1458.
1458 will not make it into 2.9. Therefore I agree that it is not very
advantageous to switch to the new API right now for Lucene users. On the
other hand, I don't think it hurts either.
I personally would vote for reverting until a complete patch that
addresses both sides of the problem is submitted and a better solution
to cloning is put forth.

If we revert now and put a new flexible API like this into 3.x, which I
think is necessary to utilize flexible indexing, then we'll have to wait
until 4.0 before we can remove the old API. Disadvantages like the one
you mentioned above, will then probably be present much longer.

I mentioned in the following thread that I have started working on a
better way of cloning, which will actually be faster compared to the old
API. I'll try to get the code out asap.

http://markmail.org/message/q7pgh2qlm2w7cxfx

I'd be happy to discuss other API proposals that anybody brings up here,
that have the same advantages and are more intuitive. We could also beef
up the documentation and give a better example about how to convert a
stream/filter from the old to the new API; a constructive suggestion
that Uwe made at the ApacheCon.

-Michael

-Grant

[1] http://issues.apache.org/jira/browse/LUCENE-1422,
[2]
http://www.lucidimagination.com/search/document/5daf6d7b8027b4d3/tokenstream_and_token_apis#9e2d0d2b5dc118d4,
and the rest of the discussion on that thread.
[3]
http://www.lucidimagination.com/search/document/4274335abcf31926/new_tokenstream_api_usage

On Jun 13, 2009, at 10:32 PM, Mark Miller wrote:

What was the big improvement with it again? Advanced, expert custom
indexing chains require less casting or something right?

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

New Token API was Re: Payloads and TrieRangeQuery

2009-06-14 Thread Grant Ingersoll

Agreed. I've been bringing it up for a while now and made the same
comments when it was first introduced, but felt like the lone voice in
the wilderness on it and gave way [1], [2], [3]. Now that others are
writing/converting, I think it is worth revisiting.

Add on top of it, that the whole point of customizing the chain is to
use it in search and, frankly speaking, somehow I think that part of
the patch was held back.

I personally would vote for reverting until a complete patch that
addresses both sides of the problem is submitted and a better solution
to cloning is put forth.

-Grant

[3]
http://www.lucidimagination.com/search/document/4274335abcf31926/new_tokenstream_api_usage

On Jun 13, 2009, at 10:32 PM, Mark Miller wrote:

Yonik Seeley wrote:
Even non-API changes have tradeoffs... the indexing improvements
(err,

total rewrite) made that code *much* harder to understand and debug.
It's a net win since the indexing performance improvements were so
fantastic.

I agree - very hard to follow, worth the improvements.

Just to throw something out, the new Token API is not very
consumable in my experience. The old one was very intuitive and very
easy to follow the code.

I've had to refigure out what the heck was going on with the new one
more than once now. Writing some example code with it is hard to
follow or justify to a new user.

What was the big improvement with it again? Advanced, expert custom
indexing chains require less casting or something right?

I dunno - anyone else have any thoughts now that the new API has
been in circulation for some time?

--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-14 Thread Earwin Burrfoot

> Just to throw something out, the new Token API is not very consumable in my
> experience. The old one was very intuitive and very easy to follow the code.
>
> I've had to refigure out what the heck was going on with the new one more
> than once now. Writing some example code with it is hard to follow or
> justify to a new user.
>
> What was the big improvement with it again? Advanced, expert custom indexing
> chains require less casting or something right?
>
> I dunno - anyone else have any thoughts now that the new API has been in
> circulation for some time?
I have an advanced, expert custom indexing chain, and it's still not
ported over the new API.
It's counter intuitive alright, with names not really saying what's
going on (please, for an AttributeSource, whose Attribute is it?
Attribute is a quality of 'something', but that 'something' is amiss),
but the biggest problem for me is that it capitalizes on the idea of
token stream even further, making filters whose output is several
times the input tokenwise, or which need to inspect a number of tokens
before emitting something - much harder to write. I most probably
missed something and there IS a way not to trash your memory with
non-reused linkedhashmaps, but than again, there's no pointers.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-13 Thread Mark Miller


Yonik Seeley wrote:

Even non-API changes have tradeoffs... the indexing improvements (err,
total rewrite) made that code *much* harder to understand and debug.
It's a net win since the indexing performance improvements were so
fantastic.
  

I agree - very hard to follow, worth the improvements.

Just to throw something out, the new Token API is not very consumable in 
my experience. The old one was very intuitive and very easy to follow 
the code.


I've had to refigure out what the heck was going on with the new one 
more than once now. Writing some example code with it is hard to follow 
or justify to a new user.


What was the big improvement with it again? Advanced, expert custom 
indexing chains require less casting or something right?


I dunno - anyone else have any thoughts now that the new API has been in 
circulation for some time?


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-13 Thread Grant Ingersoll



On Jun 13, 2009, at 8:58 AM, Michael McCandless wrote:


OK, good points Grant.  I now agree that it's not a simple task,
moving stuff core stuff from Solr -> Lucene.  So summing this all up:

 * Some feel Lucene should only aim to be the core "expert" engine
   used by Solr/Nutch/etc., so things like moving trie to core (with
   consumable naming, good defaults, etc.) are near zero priority.


I agree on the engine part, but don't agree on the expert part.  Many  
people who have their own frameworks and needs should be able to plug  
in Lucene and it should just work.  Likewise, there is a huge install  
base that must be thought of.  Still Solr/Nutch are the single largest  
users of Lucene and, wearing my PMC hat, I think it makes sense that  
we make it obvious for newbies coming in where their time is best  
spent.  If someone shows up in Solr-land and just needs Lucene because  
they want to be next to the metal, we should tell them that.  Likewise  
if they don't want to spend time doing warming, faceting, etc. they  
should just go use Solr.


Also, I have no problem with Trie being in core.  If someone wants to  
do it, go for it.  That's how it all works anyway.  Do-acracy in  
action.  It's not a priority for me, but that shouldn't stop anyone  
else.




   While I see & agree that this is indeed what Solr needs of Lucene,
   I still think direct consumbility of Lucene is important and
   Lucene should try to have a consumable API, good names for classes
   methods, good defaults, etc.


Agreed, although I think all the deprecation stuff severely limits  
Lucene's consumability.  You, as a writer of LIA know this first hand,  
and I also experience this first hand when doing Lucene training.  As  
I've pointed out countless times lately, so much cruft builds up in  
Lucene by the time that we get to X.Y release (for Y > 2, as in 2.2)  
that consumability suffers greatly.




   And I don't see those two goals as being in conflict (ie, I don't
   see Lucene having a consumable API as preventing Solr from using
   Lucene's advanced APIs), except for the fact that we all have
   limited time.

 * We have two communities.  Each has its own goal (to make its
   product good), it's own committers, etc.  While technically we
   seem to agree certain things (function queries, NumberUtils,
   highlighters, analyzers, faceted nav, etc.) logically "belong" as
   Lucene modules, the logistics and work required and different
   requirements (both one time, and ongoing) are in fact sizable
   challenges/barriers.


I take the "I know where to put it when I do it approach", but as is  
obvious, not everyone has that luxury b/c they aren't committers on  
both projects.  Integrating Tika into Solr was logical, while the  
DelimitedPayload stuff logically belonged in contrib/analyzers (to me  
anyway, and one of my primary motivations for that patch is to easily  
enable Payloads in Solr w/o having to modify how Solr works).   
Likewise, I think it makes sense for Solr's analyzers (WordDelimiter)  
to be in contrib/analyzers too, but I don't particularly think moving  
Solr's faceting stuff to Lucene is necessarily core to Lucene.  As  
seems to be my theme lately, I take it on a "case-by-case" basis.





   Perhaps once Lucene "modularizes", in the future, such
   consolidation may be easier, ie if/once there are committers
   focused on "analyzers" I could seem them helping out all
   around in pulling all analyzers together.

 * We all are obviously busy and there are more important things to
   work on than "shuffling stuff around".


+1

-Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-13 Thread Yonik Seeley

Of course consumability (good APIs) is important,  but rational people
can disagree when it comes to the specifics... many things come with
tradeoffs.

Even non-API changes have tradeoffs... the indexing improvements (err,
total rewrite) made that code *much* harder to understand and debug.
It's a net win since the indexing performance improvements were so
fantastic.

Deprecating an existing class that's been around for a while simply
for the purposes of slightly improving the naming may not be worth
it (it's destroying a little piece of the collective memory of all
developers who have used Lucene).

Avoiding having to specify the type of a sort... I'm skeptical about
the benefits vs potentially decreased flexibility and increased code
size and complexity - but it's all hypothetical at this point.

But on a positive note, Lucene 2.9 looks like it's suddenly
progressing fast enough that it's feasible to use it in Solr 1.4
(which I've lobbied for in Solr-land) - which seems like it would be a
win for both Lucene and Solr.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-13 Thread Simon Willnauer

Very true write up Grant!

On Sat, Jun 13, 2009 at 2:58 PM, Michael
McCandless wrote:
> OK, good points Grant.  I now agree that it's not a simple task,
> moving stuff core stuff from Solr -> Lucene.  So summing this all up:
>
>  * Some feel Lucene should only aim to be the core "expert" engine
>    used by Solr/Nutch/etc., so things like moving trie to core (with
>    consumable naming, good defaults, etc.) are near zero priority.
>
>    While I see & agree that this is indeed what Solr needs of Lucene,
>    I still think direct consumbility of Lucene is important and
>    Lucene should try to have a consumable API, good names for classes
>    methods, good defaults, etc.
>
>    And I don't see those two goals as being in conflict (ie, I don't
>    see Lucene having a consumable API as preventing Solr from using
>    Lucene's advanced APIs), except for the fact that we all have
>    limited time.
>
>  * We have two communities.  Each has its own goal (to make its
>    product good), it's own committers, etc.  While technically we
>    seem to agree certain things (function queries, NumberUtils,
>    highlighters, analyzers, faceted nav, etc.) logically "belong" as
>    Lucene modules, the logistics and work required and different
>    requirements (both one time, and ongoing) are in fact sizable
>    challenges/barriers.
I personally think that we should make it a requirement for a move
from Solr to Lucene there
should be a patch available for the integration into Solr. There is
still a possibility that the solr code will change before the patch
can be applied but it will be still easier for the solr team to
integrate it.
>
>    Perhaps once Lucene "modularizes", in the future, such
>    consolidation may be easier, ie if/once there are committers
>    focused on "analyzers" I could seem them helping out all
>    around in pulling all analyzers together.
For questionable Solr stuff there could still be space in contrib to
make the great work from the solr community available to others not
using solr directly. This way it would also be possible to give solr
committers write access to those contrib modules which in turn
simplyfies the integration back to solr.
I think it's great to offer fine grained modules of solr features as
contribs of lucene while keeping the core "clean". Just an idea...


>
>  * We all are obviously busy and there are more important things to
>    work on than "shuffling stuff around".
>
> So now I'm off to scrutinize LUCENE-1313... :)
>
> Mike
>
> On Fri, Jun 12, 2009 at 5:33 PM, Grant Ingersoll wrote:
>>
>> On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote:
>>
>>> On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley
>>> wrote:
>>>
 In Solr land we can quickly hack something together, spend some time
 thinking about the external HTTP interface, and immediately make it
 available to users (those using nightlies at least).  It would be a
 huge burden to say to Solr that anything of interest to the Lucene
 community should be pulled out into a module that Solr should then
 use.
>>>
>>> Sure, new and exciting things should still stay private to Solr...
>>>
 As a separate project, Solr is (and should be) free to follow
 what's in it's own best interest.
>>>
>>> Of course!
>>>
>>> I see your point, that moving things down into Lucene is added cost:
>>> we have to get consensus that it's a good thing to move (but should
>>> not be hard for many things), do all the mechanics to "transplant" the
>>> code, take Lucene's "different" requirements into account (that the
>>> consumability & stability of the Java API is important), etc.
>>
>> The problem traditionally has been that people only do the work one way.
>> That is, they take it from Solr, but then they never submit patches to Solr
>> to use the version in Lucene.  And, since many of the Lucene committers are
>> not Solr committers, even if they do the Solr work, they can't see it
>> through.
>>
>> It seems all the pure Lucene devs want the functionality of Solr, but they
>> don't want to do any of the work to remove the duplication from Solr.
>>  Additionally, it is often the case that by the time it gets into Lucene,
>> some Solr user has come along and improved the Solr version.  The Function
>> stuff is example numero uno.
>>
>> Wearing my PMC hat, I'd say if people are going to be moving stuff around
>> like this, then they better be keeping Solr up to date, too, because it is
>> otherwise creating a lot of work for Solr to the detriment of it (because
>> that time could be spent doing other things).  Still, I don't think that is
>> all that worthwhile, as it will just create a ton of extra work.  People who
>> want Solr stuff are free to pull what they need into their project.  There
>> is absolutely nothing stopping them.
>>
>> And the fact is, that no matter how much is pulled out of Solr, people will
>> still contribute things to Solr because it is it's own community and is
>> fairly autonomous, a f

Re: Payloads and TrieRangeQuery

2009-06-13 Thread Michael McCandless

OK, good points Grant.  I now agree that it's not a simple task,
moving stuff core stuff from Solr -> Lucene.  So summing this all up:

  * Some feel Lucene should only aim to be the core "expert" engine
used by Solr/Nutch/etc., so things like moving trie to core (with
consumable naming, good defaults, etc.) are near zero priority.

While I see & agree that this is indeed what Solr needs of Lucene,
I still think direct consumbility of Lucene is important and
Lucene should try to have a consumable API, good names for classes
methods, good defaults, etc.

And I don't see those two goals as being in conflict (ie, I don't
see Lucene having a consumable API as preventing Solr from using
Lucene's advanced APIs), except for the fact that we all have
limited time.

  * We have two communities.  Each has its own goal (to make its
product good), it's own committers, etc.  While technically we
seem to agree certain things (function queries, NumberUtils,
highlighters, analyzers, faceted nav, etc.) logically "belong" as
Lucene modules, the logistics and work required and different
requirements (both one time, and ongoing) are in fact sizable
challenges/barriers.

Perhaps once Lucene "modularizes", in the future, such
consolidation may be easier, ie if/once there are committers
focused on "analyzers" I could seem them helping out all
around in pulling all analyzers together.

  * We all are obviously busy and there are more important things to
work on than "shuffling stuff around".

So now I'm off to scrutinize LUCENE-1313... :)

Mike

On Fri, Jun 12, 2009 at 5:33 PM, Grant Ingersoll wrote:
>
> On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote:
>
>> On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley
>> wrote:
>>
>>> In Solr land we can quickly hack something together, spend some time
>>> thinking about the external HTTP interface, and immediately make it
>>> available to users (those using nightlies at least).  It would be a
>>> huge burden to say to Solr that anything of interest to the Lucene
>>> community should be pulled out into a module that Solr should then
>>> use.
>>
>> Sure, new and exciting things should still stay private to Solr...
>>
>>> As a separate project, Solr is (and should be) free to follow
>>> what's in it's own best interest.
>>
>> Of course!
>>
>> I see your point, that moving things down into Lucene is added cost:
>> we have to get consensus that it's a good thing to move (but should
>> not be hard for many things), do all the mechanics to "transplant" the
>> code, take Lucene's "different" requirements into account (that the
>> consumability & stability of the Java API is important), etc.
>
> The problem traditionally has been that people only do the work one way.
> That is, they take it from Solr, but then they never submit patches to Solr
> to use the version in Lucene.  And, since many of the Lucene committers are
> not Solr committers, even if they do the Solr work, they can't see it
> through.
>
> It seems all the pure Lucene devs want the functionality of Solr, but they
> don't want to do any of the work to remove the duplication from Solr.
>  Additionally, it is often the case that by the time it gets into Lucene,
> some Solr user has come along and improved the Solr version.  The Function
> stuff is example numero uno.
>
> Wearing my PMC hat, I'd say if people are going to be moving stuff around
> like this, then they better be keeping Solr up to date, too, because it is
> otherwise creating a lot of work for Solr to the detriment of it (because
> that time could be spent doing other things).  Still, I don't think that is
> all that worthwhile, as it will just create a ton of extra work.  People who
> want Solr stuff are free to pull what they need into their project.  There
> is absolutely nothing stopping them.
>
> And the fact is, that no matter how much is pulled out of Solr, people will
> still contribute things to Solr because it is it's own community and is
> fairly autonomous, a few committers that cross over not withstanding.  I'd
> venture a fair number of Solr committers know little about Lucene internals.
>  Heck, given the amount of work you do, Mike, I'd say a fair number of
> Lucene committers know very little about the internals of Lucene anymore.
>  It has been good to see you over in Solr land at least watching what is
> going on there to at least help coordinate when Solr finds Lucene errors.
>
>
>>
>> But, there is a huge benefit to having it in Lucene: you get a wider
>> community involved to help further improve it, you make Lucene
>> stronger which improves its & Solr's adoption, etc.
>>
>
> That is not always the case.  Pushing things into Lucene from Solr make it
> harder for Solr committers to do their work, unless you are proposing that
> all Solr committers should be Lucene committers.
>
> As for adoption, most people probably should just be starting with Solr
> anyway.  The

Re: Payloads and TrieRangeQuery

2009-06-12 Thread Grant Ingersoll



On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote:

On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley> wrote:



In Solr land we can quickly hack something together, spend some time
thinking about the external HTTP interface, and immediately make it
available to users (those using nightlies at least).  It would be a
huge burden to say to Solr that anything of interest to the Lucene
community should be pulled out into a module that Solr should then
use.


Sure, new and exciting things should still stay private to Solr...


As a separate project, Solr is (and should be) free to follow
what's in it's own best interest.


Of course!

I see your point, that moving things down into Lucene is added cost:
we have to get consensus that it's a good thing to move (but should
not be hard for many things), do all the mechanics to "transplant" the
code, take Lucene's "different" requirements into account (that the
consumability & stability of the Java API is important), etc.


The problem traditionally has been that people only do the work one  
way.   That is, they take it from Solr, but then they never submit  
patches to Solr to use the version in Lucene.  And, since many of the  
Lucene committers are not Solr committers, even if they do the Solr  
work, they can't see it through.


It seems all the pure Lucene devs want the functionality of Solr, but  
they don't want to do any of the work to remove the duplication from  
Solr.  Additionally, it is often the case that by the time it gets  
into Lucene, some Solr user has come along and improved the Solr  
version.  The Function stuff is example numero uno.


Wearing my PMC hat, I'd say if people are going to be moving stuff  
around like this, then they better be keeping Solr up to date, too,  
because it is otherwise creating a lot of work for Solr to the  
detriment of it (because that time could be spent doing other  
things).  Still, I don't think that is all that worthwhile, as it will  
just create a ton of extra work.  People who want Solr stuff are free  
to pull what they need into their project.  There is absolutely  
nothing stopping them.


And the fact is, that no matter how much is pulled out of Solr, people  
will still contribute things to Solr because it is it's own community  
and is fairly autonomous, a few committers that cross over not  
withstanding.  I'd venture a fair number of Solr committers know  
little about Lucene internals.  Heck, given the amount of work you do,  
Mike, I'd say a fair number of Lucene committers know very little  
about the internals of Lucene anymore.  It has been good to see you  
over in Solr land at least watching what is going on there to at least  
help coordinate when Solr finds Lucene errors.





But, there is a huge benefit to having it in Lucene: you get a wider
community involved to help further improve it, you make Lucene
stronger which improves its & Solr's adoption, etc.



That is not always the case.  Pushing things into Lucene from Solr  
make it harder for Solr committers to do their work, unless you are  
proposing that all Solr committers should be Lucene committers.


As for adoption, most people probably should just be starting with  
Solr anyway.  The fact is that every Lucene committer to the tee will  
tell you that they have built something that more or less looks like  
Solr.  Lucene is great as a low-level Vector Space implementation with  
some nice contribs, but much of the interesting stuff in search these  
days happens at the layer up (and arguably even a layer above that in  
terms of UI and intelligent search, etc).  In Lucene PMC land, that  
area is Solr and Nutch.  My personal opinion is that Lucene should  
focus on being a really fast, core search library and that the outlet  
for the higher level stuff is in Solr and Nutch.  It is usually  
obvious when things belong in the core, because people bring them up  
in the appropriate place (there are some rare exceptions, that you  
have mentioned)


-Grant



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-12 Thread Michael McCandless

On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley wrote:

> In Solr land we can quickly hack something together, spend some time
> thinking about the external HTTP interface, and immediately make it
> available to users (those using nightlies at least).  It would be a
> huge burden to say to Solr that anything of interest to the Lucene
> community should be pulled out into a module that Solr should then
> use.

Sure, new and exciting things should still stay private to Solr...

> As a separate project, Solr is (and should be) free to follow
> what's in it's own best interest.

Of course!

I see your point, that moving things down into Lucene is added cost:
we have to get consensus that it's a good thing to move (but should
not be hard for many things), do all the mechanics to "transplant" the
code, take Lucene's "different" requirements into account (that the
consumability & stability of the Java API is important), etc.

But, there is a huge benefit to having it in Lucene: you get a wider
community involved to help further improve it, you make Lucene
stronger which improves its & Solr's adoption, etc.

What's good for Lucene is good for Solr.

Eg why hasn't NumberUtils been folded into Lucene, aeons ago?  I
realize it's not the perfect solution (and trie* seems to be better),
but it's certainly better than the "nothing" we've had for a long
time...

Why not the custom fragmenters (Gap, Regex) that Solr has to improve
highlighting?  EG it looks like Solr can approximately produce
sentences as fragments.  This would be a great addition to Lucene's
highlighter.

(NOTE: I fully realize that a large number of things do get moved from
Solr to Lucene, over time, and that's great; I'm saying we should very
much keep that up).

But of course we are as usual resource starved...

>> For example, Solr would presumably prefer that trie* remain in contrib?
>
> From a capabilities perspective, it doesn't matter much if it's in
> contrib or core I think.  It's a small amount of work to adapt to
> class name changes, but nothing to complain about.
>
> But it doesn't seem like Trie should be treated specially somehow...

Trie is *very* useful.  It plugs a serious weakness in Lucene (ootb
handling of numeric fields).  The things one must now do to have a
numeric field work "properly" are crazy.  Trie makes Lucene more
useful & consumable; it's a powerful feature.  It should be treated
specially.

But: I certainly see your point, that Solr could care less about such
consumability, and leaving trie in contrib would be just fine (from
Solr's standpoint).

> seems to go down a path that makes customer provided filters
> second-class citizens.  I hate it when Java does stuff like that to me
> (their provided classes can do more than mine).

I don't really see the connection here.  If we make trie* the default
for handling of numeric fields in Lucene, how does that hurt customer
provided filters?

>> There's a single set of Solr developers, but a very wide range of
>> direct Lucene users.  I don't see how Lucene having good consumability
>> actually makes Solr's life harder.  Those raw APIs would still be
>> accessible to Solr...  simple things should be simple (direct Lucene
>> users) and complex things should be possible (Solr).
>
> But with changes come deprecations - forced changes when the
> deprecations are removed.  Sometimes those are easy to adapt to,
> sometimes not.  If those required changes don't actually add any
> functionality to Solr, it's a net negative if you're looking at it
> from Solr's point of view.  That doesn't mean Lucene shouldn't - and
> I've not complained to Lucene in the past because it wasn't Lucene's
> responsibility.

Right, so this is why the move of trie from contrib -> core is net
negative for Solr (things work fine now, and it only creates work for
you).

But, if it improves Lucene's adoption, because Lucene is more
consumable, that then becomes a positive for Solr.

And BTW you should "complain" to Lucene more if we're doing things
that are not Solr friendly.  Honestly, if anything, we don't hear
enough from you ;)

Such complaints will presumably often match this one ("Solr wants a
raw engine; Lucene wants consumability") and we'll just have to agree
to disagree.  But other times I'm sure we'd get something net/net good
out of the resulting discussion.

>>> and taking it out of Solr's release cycle and easy ability to change -
>>> if Solr needs to make a change to one of the moved classes, it's
>>> necessary to get it through the Lucene change process and then upgrade
>>> to the latest Lucene trunk - all or nothing.
>>
>> "Getting through Lucene's change process" should be real simple for
>> you all :)
>
> Y'r kidding, right? ;-)
> It's sometimes hard enough to get stuff through either community, let
> alone both.

Actually I wasn't kidding!

Sure there's the "normal" open-source challenges -- getting someone's
attention, the mechanics of making a patch & iterating, sometimes
massive unrelated d

Re: Payloads and TrieRangeQuery

2009-06-11 Thread Yonik Seeley

In Solr land we can quickly hack something together, spend some time
thinking about the external HTTP interface, and immediately make it
available to users (those using nightlies at least).  It would be a
huge burden to say to Solr that anything of interest to the Lucene
community should be pulled out into a module that Solr should then
use.  As a separate project, Solr is (and should be) free to follow
what's in it's own best interest.

[...]
> For example, Solr would presumably prefer that trie* remain in contrib?

>From a capabilities perspective, it doesn't matter much if it's in
contrib or core I think.  It's a small amount of work to adapt to
class name changes, but nothing to complain about.

But it doesn't seem like Trie should be treated specially somehow...
seems to go down a path that makes customer provided filters
second-class citizens.  I hate it when Java does stuff like that to me
(their provided classes can do more than mine).

> There's a single set of Solr developers, but a very wide range of
> direct Lucene users.  I don't see how Lucene having good consumability
> actually makes Solr's life harder.  Those raw APIs would still be
> accessible to Solr...  simple things should be simple (direct Lucene
> users) and complex things should be possible (Solr).

But with changes come deprecations - forced changes when the
deprecations are removed.  Sometimes those are easy to adapt to,
sometimes not.  If those required changes don't actually add any
functionality to Solr, it's a net negative if you're looking at it
from Solr's point of view.  That doesn't mean Lucene shouldn't - and
I've not complained to Lucene in the past because it wasn't Lucene's
responsibility.

[...]
>> and taking it out of Solr's release cycle and easy ability to change -
>> if Solr needs to make a change to one of the moved classes, it's
>> necessary to get it through the Lucene change process and then upgrade
>> to the latest Lucene trunk - all or nothing.
>
> "Getting through Lucene's change process" should be real simple for
> you all :)

Y'r kidding, right? ;-)
It's sometimes hard enough to get stuff through either community, let
alone both.
For code that's in Solr, we only have to worry about Solr's concerns,
not about all users of Lucene.  Big difference.

> And, Solr upgrades Lucene's JAR fairly often already?

And a lot of our users don't like it.
It's also become much more difficult due to all the Lucene changes
lately.  It's something we should be doing less of, not more of
unless we formally merge the projects or something ;-)

> NumericField would enforce one token, during indexing.

That only changes the point at which the user realizes "uh, this won't
work", and it's still at a point after they've written their code.
Checking like this doesn't even feel like it belongs at the indexing
level.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-11 Thread Michael McCandless

On Thu, Jun 11, 2009 at 9:20 AM, Uwe Schindler wrote:

> In my opinion, solr and lucene should exchange technology much more. Solr
> should concentrate on the "search server" and lucene should provide the
> technology.

+1

> All additional implementations inside solr like faceting and so
> on, could also live in lucene. I would have nice usages for it (I do not use
> Solr, but have my own Lucene framework, that I cannot give up because of
> various reasons). But e.g. Solr's facetting, Solr's analyzers and so on
> would be great in lucene as "modules".

+1

> I get a lot of questions all the
> time, how to do this and that, because the people don't understand, why they
> must first map the float to an int. If they do it, the next question is:
> "why does this work, I do not want to loose precision" and so on. I will do
> it with TrieTokenStream

Exactly!  Those are new users struggling with trie... Lucene's
consumability is important.

> In my opinion, the classes should stay Trie* and not Numeric*. Maybe we have
> different implementations for numeric Range queries in future. In this case,
> I think Yonik is right. The class name should describe how it works if there
> may be different and incompatible implementations in future.

But... we don't name our other classes according to how they're
implemented?

EG, JavaCCQueryParser or JFlexQueryParser, or
writeSingleDocSegmentInRAM (instead of addDocument, pre-2.3), or
bufferPendingDelete (instead of deleteDocument), or
RangeQueryAsBooleanQuery, RangeQueryAsConstantScoreFilter,
SortUsingFieldCache, etc.

I absolutely love "trie", as a developer, but the average user won't
know what the trie data structure is.  TrieRangeQuery is less
consumable than NumericRangeQuery (or, simply RangeQuery).

As a user I know Lucene is doing all kinds of cool stuff under the
hood to get its job done... but those cool things shouldn't be in the
names.  The name should describe what it does, not how...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-11 Thread Michael McCandless

On Thu, Jun 11, 2009 at 8:46 AM, Yonik Seeley wrote:

>>> Really goes into Solr land... my pref for Lucene is to remain a core
>>> expert-level full-text search library and keep out things that are
>>> easy to do in an application or at another level.
>>
>> I think this must be the crux of our disagreement.
>
> Indeed.  The itch to scratch w.r.t Solr in Lucene is increased core
> functionality, not more magic (that duplicates what Solr already does,
> but just in a different way and thus makes the lives of Solr
> developers harder).

But... Solr's needs are very different from direct users of Lucene.

I completely agree that Solr needs & wants only the low-level APIs in
Lucene, a raw engine, that doesn't bother with good defaults,
consumability, etc.  Just the raw stuff.  If Lucene existed only for
Solr, we'd be done here.

But Lucene is used by many direct users, and those users benefit from
good defaults & consumability.

For example, Solr would presumably prefer that trie* remain in contrib?

There's a single set of Solr developers, but a very wide range of
direct Lucene users.  I don't see how Lucene having good consumability
actually makes Solr's life harder.  Those raw APIs would still be
accessible to Solr...  simple things should be simple (direct Lucene
users) and complex things should be possible (Solr).

BTW, I don't mean to "pick" on trie*; I think there are many other
examples where we could improve Lucene's consumability.  EG, for
highlighter, you should pretty much always use its SpanScorer; yet,
it's completely non-obvious (even having read the javadocs) how to do
so.  Why isn't this the default scorer for Highlighter?

Such a situation doesn't affect Solr: you all are experts on all
aspects of Lucene, and you can figure it out.  But your average user
will do the obvious thing but then notice highlighting for phrase
searches is buggy, and conclude Lucene is buggy and go and
prefer the other search engine they are testing.  It's a trap.

Lucene's consumability is important.

> If we asked on java-user about people's priorities/wishes, I bet
> column stride fields, near real time indexing, and better performance
> would dominate stuff like not having to specify how to sort a field.

I think all of the above are important :)

>> I feel, instead, that Lucene should stand on its own, as a useful
>> search library, with a consumable API, good defaults, etc.  Lucene is
>> more than "the expert level search API that's embedded in
>> Solr". Lucene is consumed directly by apps other than Solr.
>>
>> In fact, I think there are many things in Solr that naturally belong
>> in Lucene (and over time we've been gradually slurping them down).
>> The line/criteria has always been rather blurry...
>
> And conversely, Solr isn't just a wrapper around Lucene and an
> incubator for Lucene technology.

Of course not: there's lots of good stuff in Solr that should stay in
Solr.

But eg the neat analyzers/tokenizers, search filters, faceted nav,
custom collectors, function queries (now diverged), CharFilter (in
progress), improvements to highlighter, etc., should really all be in
Lucene instead (as "modules")?

> Ask Lucene users if they would like pretty much any substantial piece
> of functionality in Solr moved to Lucene as a module and you'll
> probably get an affirmative answer.  But moving something from Solr to
> Lucene can have a lot of negative effects for Solr, including taking
> it out of the hands of Solr committers who aren't Lucene committers,

We should simply make such Solr committers Lucene committers, if they
are indeed working on stuff that should be in Lucene?

> and taking it out of Solr's release cycle and easy ability to change -
> if Solr needs to make a change to one of the moved classes, it's
> necessary to get it through the Lucene change process and then upgrade
> to the latest Lucene trunk - all or nothing.

"Getting through Lucene's change process" should be real simple for
you all :)  And, Solr upgrades Lucene's JAR fairly often already?

> It's also the case that the goals of Lucene classes and Solr classes
> are often very different.  Lucene is more concerned with Java APIs (as
> should be the case), while they are a bit more secondary in Solr...
> the external APIs are of primary importance and one doesn't worry as
> much (or at all) about the classes implementing that interface or it's
> Java API back compatibility (as a generalization... it depends on the
> class).

Different, yes, but not incompatible?

>> In Lucene, we should be able to add a NumericField to a document,
>> index it, and then create RangeFilter or Sort on that field and have
>> things "just work".
>
> That feels like a false sense of simplicity, and Lucene isn't for
> dummies ;-)  One needs to understand how things work under the hood to
> avoid shooting oneself in the foot.  You need to understand the memory
> implications of sorting on different fields, and you need to
> understand that to sort on a text field

RE: Payloads and TrieRangeQuery

2009-06-11 Thread Uwe Schindler

From: Michael McCandless [mailto:luc...@mikemccandless.com]
> On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley
> wrote:
> 
> > Really goes into Solr land... my pref for Lucene is to remain a core
> > expert-level full-text search library and keep out things that are
> > easy to do in an application or at another level.
> 
> I think this must be the crux of our disagreement.
> 
> I feel, instead, that Lucene should stand on its own, as a useful
> search library, with a consumable API, good defaults, etc.  Lucene is
> more than "the expert level search API that's embedded in
> Solr". Lucene is consumed directly by apps other than Solr.

This is my opinion, too

> In fact, I think there are many things in Solr that naturally belong
> in Lucene (and over time we've been gradually slurping them down).
> The line/criteria has always been rather blurry...

There is currently also some overlapping, like functions queries implemented
in both projects in different versions. Because the function queries were
moved to lucene in the past, but they are still alive in Solr. There is also
some overlap in the Analyzers. I would really like to have this very generic
configureable analyzer in Lucene, so I do not need to create a new sub class
for each analyzer.

For other projects (like my panFMP), this would be really great, to just
create an Analyzer instance and add some filters into a list with addFilter
or something like that.

In my opinion, solr and lucene should exchange technology much more. Solr
should concentrate on the "search server" and lucene should provide the
technology. All additional implementations inside solr like faceting and so
on, could also live in lucene. I would have nice usages for it (I do not use
Solr, but have my own Lucene framework, that I cannot give up because of
various reasons). But e.g. Solr's facetting, Solr's analyzers and so on
would be great in lucene as "modules".

> In Lucene, we should be able to add a NumericField to a document,
> index it, and then create RangeFilter or Sort on that field and have
> things "just work".
> 
> Unfortunately, we are far from being able to do so today.  We can (and
> should, for 2.9) make baby steps towards this, by aborbing trie* into
> core, but you're still gonna have to "just know" to make the specific
> FieldCache parser and specific NumericRangeQuery to match how you
> indexed.  It's awkward and not consumable, but I think it'll just have to
> do for now...

I will have time at the weekend (currently I am working hard for PANGAEA not
related to Lucene at the moment) and create a core-trierange like defined in
LUCENE-1673. The usage pattern will be similar to contrib., now, only with a
revised API, making it simplier to instantiate the TokenStreams and
RangeQueries with the different data types (I get a lot of questions all the
time, how to do this and that, because the people don't understand, why they
must first map the float to an int. If they do it, the next question is:
"why does this work, I do not want to loose precision" and so on. I will do
it with TrieTokenStream

In my opinion, the classes should stay Trie* and not Numeric*. Maybe we have
different implementations for numeric Range queries in future. In this case,
I think Yonik is right. The class name should describe how it works if there
may be different and incompatible implementations in future.

Uwe

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-11 Thread Yonik Seeley

On Thu, Jun 11, 2009 at 7:01 AM, Michael
McCandless wrote:
> On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley 
> wrote:
>
>> Really goes into Solr land... my pref for Lucene is to remain a core
>> expert-level full-text search library and keep out things that are
>> easy to do in an application or at another level.
>
> I think this must be the crux of our disagreement.

Indeed.  The itch to scratch w.r.t Solr in Lucene is increased core
functionality, not more magic (that duplicates what Solr already does,
but just in a different way and thus makes the lives of Solr
developers harder).
If we asked on java-user about people's priorities/wishes, I bet
column stride fields, near real time indexing, and better performance
would dominate stuff like not having to specify how to sort a field.

> I feel, instead, that Lucene should stand on its own, as a useful
> search library, with a consumable API, good defaults, etc.  Lucene is
> more than "the expert level search API that's embedded in
> Solr". Lucene is consumed directly by apps other than Solr.
>
> In fact, I think there are many things in Solr that naturally belong
> in Lucene (and over time we've been gradually slurping them down).
> The line/criteria has always been rather blurry...

And conversely, Solr isn't just a wrapper around Lucene and an
incubator for Lucene technology.
Ask Lucene users if they would like pretty much any substantial piece
of functionality in Solr moved to Lucene as a module and you'll
probably get an affirmative answer.  But moving something from Solr to
Lucene can have a lot of negative effects for Solr, including taking
it out of the hands of Solr committers who aren't Lucene committers,
and taking it out of Solr's release cycle and easy ability to change -
if Solr needs to make a change to one of the moved classes, it's
necessary to get it through the Lucene change process and then upgrade
to the latest Lucene trunk - all or nothing.

It's also the case that the goals of Lucene classes and Solr classes
are often very different.  Lucene is more concerned with Java APIs (as
should be the case), while they are a bit more secondary in Solr...
the external APIs are of primary importance and one doesn't worry as
much (or at all) about the classes implementing that interface or it's
Java API back compatibility (as a generalization... it depends on the
class).

> In Lucene, we should be able to add a NumericField to a document,
> index it, and then create RangeFilter or Sort on that field and have
> things "just work".

That feels like a false sense of simplicity, and Lucene isn't for
dummies ;-)  One needs to understand how things work under the hood to
avoid shooting oneself in the foot.  You need to understand the memory
implications of sorting on different fields, and you need to
understand that to sort on a text field, there really needs to be just
one token per field.  You need to understand that the way Trie is
indexed, and that multiple values per field won't work if you use a
precision step less than the word size.

There have been a lot of bad design decisions (I'm talking software
development in general) due to citing "the user will be confused".
Often, this hypothetical user doesn't exist (or is an extreme
minority), and hence I prefer things of the form "I think this is
confusing".  Extra magic isn't always a good thing.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-11 Thread Michael McCandless

On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley wrote:

> Really goes into Solr land... my pref for Lucene is to remain a core
> expert-level full-text search library and keep out things that are
> easy to do in an application or at another level.

I think this must be the crux of our disagreement.

I feel, instead, that Lucene should stand on its own, as a useful
search library, with a consumable API, good defaults, etc.  Lucene is
more than "the expert level search API that's embedded in
Solr". Lucene is consumed directly by apps other than Solr.

In fact, I think there are many things in Solr that naturally belong
in Lucene (and over time we've been gradually slurping them down).
The line/criteria has always been rather blurry...

In Lucene, we should be able to add a NumericField to a document,
index it, and then create RangeFilter or Sort on that field and have
things "just work".

Unfortunately, we are far from being able to do so today.  We can (and
should, for 2.9) make baby steps towards this, by aborbing trie* into
core, but you're still gonna have to "just know" to make the specific
FieldCache parser and specific NumericRangeQuery to match how you
indexed.  It's awkward and not consumable, but I think it'll just have to
do for now...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 5:45 PM, Michael McCandless
 wrote:
> But, I realize this is a stretch... eg we'd have to fix rewrite to be
> per-segment, which certainly seems spooky.  A top-level schema would
> definitely be cleaner.

Really goes into Solr land... my pref for Lucene is to remain a core
expert-level full-text search library and keep out things that are
easy to do in an application or at another level.  Having to specify
what type of field you are sorting on really doesn't seem like a
hardship.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

> > Another question not so simple to answer: When embedding these
> TermPositions
> > into the whole process, how would this work with MultiTermQuery?
> 
> There's no reason why Trie has to use MultiTermQuery, right?

No but is elegant and simplifies much (see current code in trunk).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

> I think we'd need richer communication between MTQ and its subclasses,
> so that eg your enum would return a Query instead of a Term?
> 
> Then you'd either return a TermQuery, or, a BooleanQuery that's
> filtering the TermQuery?
> 
> But yes, doing after 3.0 seems good!

There is one other thing that needs to wait for 3.x: If you then want to
sort against such a field or use the trie values for function queries in a
field cache, we can have a really fast numeric UninverterValueSource,
because less terms, each with many documents. The value to store in the
cache is only (prefixCodedToLong(term) << positionBits | termPosition) -
cool. Would be really fast!

(But for that we need the new field cache stuff).

...now going to sleep with many ideas buzzing around.
Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

> Another question not so simple to answer: When embedding these TermPositions
> into the whole process, how would this work with MultiTermQuery?

There's no reason why Trie has to use MultiTermQuery, right?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 5:24 PM, Yonik Seeley wrote:
> On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless
>  wrote:
>> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:
>>  * Was the field even indexed w/ Trie, or indexed as "simple text"?
>
> Why the special treatment for Trie?

So that at search time things default properly.  Ie, RangeFilter would
rewrite to the right impl (if we made a single RangeFilter that
handled both term & numeric ranges), and sorting could pick the right
parser.

Ie, ideally one simply adds NumericField to their Document, indexes
it, and then range filtering & sorting "just work".  It's confusing
now the separate steps you must go through to use trie, because
Lucene doesn't remember that you indexed with trie.

But, I realize this is a stretch... eg we'd have to fix rewrite to be
per-segment, which certainly seems spooky.  A top-level schema would
definitely be cleaner.

>>  * We have a bug (or an important improvement) in how Trie encodes
>>terms that we need to fix.  This one is not easy to handle, since
>>such a change could alter the term order, and merging segments
>>then becomes problematic.  Not sure how to handle that.  Yonik,
>>has Solr ever had to make a change to NumberUtils?
>
> Nope.  If we needed to, we would make a new field type so that
> existing schemas/indexes would continue to work.

OK seems like Lucene should take the same approach.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

I think we'd need richer communication between MTQ and its subclasses,
so that eg your enum would return a Query instead of a Term?

Then you'd either return a TermQuery, or, a BooleanQuery that's
filtering the TermQuery?

But yes, doing after 3.0 seems good!

Mike

On Wed, Jun 10, 2009 at 5:26 PM, Uwe Schindler wrote:
>> I would like to go forward with moving the classes into the right packages
>> and optimize the way, how queries and analyzers are created (only one
>> class
>> for each). The idea from LUCENE-1673 to use static factories to create
>> these
>> classes for the different data types seems to be more elegant and simplier
>> to maintain than the current way (having a class for each bit size).
>>
>> So I think I will start with 1673 and try to present something useable,
>> soon
>> (but without payloads, so the payload/position-bits setting is "0").
>
> Another question not so simple to answer: When embedding these TermPositions
> into the whole process, how would this work with MultiTermQuery? The current
> algorithm is simple: The TrieRangeTermEnum simply enumerates the possible
> terms from the index and MTQ creates the BitSet or a BooleanQuery of
> TermQueries. How to do this with positions? In both cases there need
> specialities (the TermEnum must return that the actual term is a
> payload/position one and must filter using TermPositions). For the filter
> its then easy, the TermQueries added to BooleanQuery in the other case must
> also use the payloads. Questions & more questions.
>
> I tend to release TrieRange with 2.9 without Positions/Payloads.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot

>  * Was the field even indexed w/ Trie, or indexed as "simple text"?
>    It's useful to know this "automatically" at search time, so eg a
>    RangeQuery can do the right thing by default.  FieldInfos seems
>    like the natural place to store this.  It's basically Lucene's
>    per-segment write-once schema.  Eg we use this to record "did any
>    token in this field have a Payload?", which is analogous.
This should really be in a schema of some kind (like in my project for
instance).
Why do you do autodetection for tries, but recently removed it for FieldCache?
Things should be concise, either store all settings in the index (and
die in the process), or don't store them there at all.

>  * We have a bug (or an important improvement) in how Trie encodes
>    terms that we need to fix.  This one is not easy to handle, since
>    such a change could alter the term order, and merging segments
>    then becomes problematic.  Not sure how to handle that.  Yonik,
>    has Solr ever had to make a change to NumberUtils?
There are cases when reindexing is inevitable. What so horrible about
it anyway? Even if you have a humongous index, you can rebuild it in a
matter of days, and you don't do this often.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

> I would like to go forward with moving the classes into the right packages
> and optimize the way, how queries and analyzers are created (only one
> class
> for each). The idea from LUCENE-1673 to use static factories to create
> these
> classes for the different data types seems to be more elegant and simplier
> to maintain than the current way (having a class for each bit size).
> 
> So I think I will start with 1673 and try to present something useable,
> soon
> (but without payloads, so the payload/position-bits setting is "0").

Another question not so simple to answer: When embedding these TermPositions
into the whole process, how would this work with MultiTermQuery? The current
algorithm is simple: The TrieRangeTermEnum simply enumerates the possible
terms from the index and MTQ creates the BitSet or a BooleanQuery of
TermQueries. How to do this with positions? In both cases there need
specialities (the TermEnum must return that the actual term is a
payload/position one and must filter using TermPositions). For the filter
its then easy, the TermQueries added to BooleanQuery in the other case must
also use the payloads. Questions & more questions.

I tend to release TrieRange with 2.9 without Positions/Payloads.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless
 wrote:
> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:
>  * Was the field even indexed w/ Trie, or indexed as "simple text"?

Why the special treatment for Trie?

>    It's useful to know this "automatically" at search time, so eg a
>    RangeQuery can do the right thing by default.  FieldInfos seems
>    like the natural place to store this.  It's basically Lucene's
>    per-segment write-once schema.  Eg we use this to record "did any
>    token in this field have a Payload?", which is analogous.

It doesn't seem analogous to me.  Trie is just another implementation
for numerics with it's own tradeoffs.

>  * We have a bug (or an important improvement) in how Trie encodes
>    terms that we need to fix.  This one is not easy to handle, since
>    such a change could alter the term order, and merging segments
>    then becomes problematic.  Not sure how to handle that.  Yonik,
>    has Solr ever had to make a change to NumberUtils?

Nope.  If we needed to, we would make a new field type so that
existing schemas/indexes would continue to work.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 5:07 PM, Uwe Schindler wrote:
> I would really like to leave this optimization out for 2.9. We can still add
> this after 2.9 as an optimization. The number of bits encoded into the
> TermPosition (this is really a cool idea, thanks Yonik, I was missing
> exactly that, because you do not need to convert the bits, you can directly
> put them into the index as int and use them on the query side!) is simply 0
> for indexes created with 2.9. With later versions, you could also shift the
> lower bits into the TermPosition and tell TrieRange to filter them.

I agree, let's aim for after 3.0 for this.  (Note that, in theory, 3.0
should follow quickly after 2.9, having "only" removed deprecated
APIs, changed settings, etc.).  Can you open & issue & mark as 3.1?

> I would like to go forward with moving the classes into the right packages
> and optimize the way, how queries and analyzers are created (only one class
> for each). The idea from LUCENE-1673 to use static factories to create these
> classes for the different data types seems to be more elegant and simplier
> to maintain than the current way (having a class for each bit size).

+1

> So I think I will start with 1673 and try to present something useable, soon
> (but without payloads, so the payload/position-bits setting is "0").
> Now the oen question: Which name for the numeric range queries/fields? :-(

How about:

  Range* -> TermRange*
  TrieRange* -> NumericRange*
  FieldCacheRangeFilter -> FieldCacheTermRangeFilter
  ConstantScoreRangeQuery stays as is (it's deprecated)

Are there any others that need renaming?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

> On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless
>  wrote:
> > On Wed, Jun 10, 2009 at 3:19 PM, Yonik
> Seeley wrote:
> >
> >>> And this information about the trie
> >>> structure and where payloads are should be stored in FieldInfos.
> >>
> >> As is the case today, the info is encoded in the class you use (and
> >> it's settings)... no need to add it to the index structure.  In any
> >> case, it's a completely different issue and shouldn't be tied to
> >> TrieRange improvements.
> >
> > The problem is, because the details of Trie* at index time affect
> > what's in each segment, this information needs to be stored per
> > segment.
> 
> That's the case with the analysis for every field.  If you change your
> analyzer in a non-compatible fashion, you need to re-index.

I agree with Mike to store information like the data type in the index, but
on the other hand, Yonik is correct, too. If I change my analyzer (and
TrieTokenStream is in fact one, an analyzer that creates tokens out of a
number), I have to reindex.

The problem with storing different indexing settings (precisionStep,
payload/position bits) per segment makes merging nearly impossible, so I
would not do this (see also Earwins comment about that).

About releasing 2.9:
I would really like to leave this optimization out for 2.9. We can still add
this after 2.9 as an optimization. The number of bits encoded into the
TermPosition (this is really a cool idea, thanks Yonik, I was missing
exactly that, because you do not need to convert the bits, you can directly
put them into the index as int and use them on the query side!) is simply 0
for indexes created with 2.9. With later versions, you could also shift the
lower bits into the TermPosition and tell TrieRange to filter them.

I would like to go forward with moving the classes into the right packages
and optimize the way, how queries and analyzers are created (only one class
for each). The idea from LUCENE-1673 to use static factories to create these
classes for the different data types seems to be more elegant and simplier
to maintain than the current way (having a class for each bit size).

So I think I will start with 1673 and try to present something useable, soon
(but without payloads, so the payload/position-bits setting is "0").
Now the oen question: Which name for the numeric range queries/fields? :-(

Uwe

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:

> And then, when you merge segments indexed with different Trie*
> settings, you need to convert them to some common form.
> Sounds like something too complex and with minimum returns.

Oh yeah... tricky.  So... there are various situations to handle with
trie:

  * Was the field even indexed w/ Trie, or indexed as "simple text"?
It's useful to know this "automatically" at search time, so eg a
RangeQuery can do the right thing by default.  FieldInfos seems
like the natural place to store this.  It's basically Lucene's
per-segment write-once schema.  Eg we use this to record "did any
token in this field have a Payload?", which is analogous.

  * How did you tune your payload-vs-trie-range setting.  OK, I agree:
this is most similar to "you changed your analyzer in an
incompatible way, so, you have to reindex".  Plus, during merging
we can't [easily] translate this.  So we shouldn't try to keep
track of this.

  * We have a bug (or an important improvement) in how Trie encodes
terms that we need to fix.  This one is not easy to handle, since
such a change could alter the term order, and merging segments
then becomes problematic.  Not sure how to handle that.  Yonik,
has Solr ever had to make a change to NumberUtils?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless
 wrote:
> On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley 
> wrote:
>
>>> And this information about the trie
>>> structure and where payloads are should be stored in FieldInfos.
>>
>> As is the case today, the info is encoded in the class you use (and
>> it's settings)... no need to add it to the index structure.  In any
>> case, it's a completely different issue and shouldn't be tied to
>> TrieRange improvements.
>
> The problem is, because the details of Trie* at index time affect
> what's in each segment, this information needs to be stored per
> segment.

That's the case with the analysis for every field.  If you change your
analyzer in a non-compatible fashion, you need to re-index.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot

>>> And this information about the trie
>>> structure and where payloads are should be stored in FieldInfos.
>>
>> As is the case today, the info is encoded in the class you use (and
>> it's settings)... no need to add it to the index structure.  In any
>> case, it's a completely different issue and shouldn't be tied to
>> TrieRange improvements.
>
> The problem is, because the details of Trie* at index time affect
> what's in each segment, this information needs to be stored per
> segment.

And then, when you merge segments indexed with different Trie*
settings, you need to convert them to some common form.
Sounds like something too complex and with minimum returns.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley wrote:

>> And this information about the trie
>> structure and where payloads are should be stored in FieldInfos.
>
> As is the case today, the info is encoded in the class you use (and
> it's settings)... no need to add it to the index structure.  In any
> case, it's a completely different issue and shouldn't be tied to
> TrieRange improvements.

The problem is, because the details of Trie* at index time affect
what's in each segment, this information needs to be stored per
segment.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote:

>> I wonder how performance would compare.  Without payloads, there are
>> many more terms (for the tiny ranges) in the index, and your OR query
>> will have lots of these tiny terms.  But then these tiny terms don't
>> hit many docs, and with BooleanScorer (which we should switch to for
>> OR queries) ought not be very costly.
>
> That ist true. The main idea was to limit also seeking during the query.
> When splitting the range, you need to often start new TermEnums and iterate
> over lot of term. By catching many docs with less terms, you only need to
> scan forward in the payloads.

OK, though we should separately test "cold" searches (seeking matters)
and "hot" searches (seeking doesn't).  And we should separately test
SSD vs spinning drive for the cold case.  Seeking is much less costly
(though still more costly than "hot" searches) with SSDs...

>> Vs w/ payloads having to use
>> TermPositions, having to load, decode & check the payload, and I guess
>> assuming on average that 1/2 the docs are filtered out.
>
> Maybe decoding the payload is not needed, I would encode the bounds as
> byte[] and compare the arrays. But you would filter about half of the docs
> out.

Yonik's idea (encoding in the position) seems great here.

> My problem with all this is how to optimize after which shift value to
> switch between terms and payloads.

Presumably you'd "roughly" balance seek time vs "wasted doc filtered
out" time, to set the default, and make it configurable.

> And this information about the trie
> structure and where payloads are should be stored in FieldInfos.
>
> As we now search on each segment separately, this information can be stored
> per segment and also used for each per-segment Filter/Scorer.

Right, I think it should, but I agree w/ Yonik (partially) that it's orthogonal.

> The whole thing works out of the box with TrieRangeFilter (its just
> iterating over terms, getting TermDocs/TermPositions and setting bits, when
> payloads available after checking these), for TrieRangeQuery using
> BooleanQuery it is more complicated (MTQ cannot simply add the terms from
> the FilteredTermEnum to a BooleanQuery).

Seems like we should generalize MTQ so that the subclass could return
which clause should be added for each term, to the BQ?  (We are also
still needing to improve MTQ to decouple constant-scoring from "use BQ
or filter"... there's an issue opened for that).

> Until now I had no time to think about it in detail, but with maybe the
> possibility to have TrieRange in Core and store trie-specific FieldInfos per
> segment, I will get clearer how to manage this in the API.

I'd really like to see TrieRange in core for 2.9...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler  wrote:
> My problem with all this is how to optimize after which shift value to
> switch between terms and payloads.

Just make it a configurable number of bits at the end that are
"stored" instead of indexed.  People will want to select different
tradeoffs anyway.

What about using the position (as opposed to a payload) to encode the
last bits?  Should be faster, no?

> And this information about the trie
> structure and where payloads are should be stored in FieldInfos.

As is the case today, the info is encoded in the class you use (and
it's settings)... no need to add it to the index structure.  In any
case, it's a completely different issue and shouldn't be tied to
TrieRange improvements.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

> Ooh that sounds compelling!
> 
> So you would not need to use payloads for the "inside" brackets,
> right?  Only for the edges?

Exactly.

> I wonder how performance would compare.  Without payloads, there are
> many more terms (for the tiny ranges) in the index, and your OR query
> will have lots of these tiny terms.  But then these tiny terms don't
> hit many docs, and with BooleanScorer (which we should switch to for
> OR queries) ought not be very costly. 

That ist true. The main idea was to limit also seeking during the query.
When splitting the range, you need to often start new TermEnums and iterate
over lot of term. By catching many docs with less terms, you only need to
scan forward in the payloads.

> Vs w/ payloads having to use
> TermPositions, having to load, decode & check the payload, and I guess
> assuming on average that 1/2 the docs are filtered out.

Maybe decoding the payload is not needed, I would encode the bounds as
byte[] and compare the arrays. But you would filter about half of the docs
out.

My problem with all this is how to optimize after which shift value to
switch between terms and payloads. And this information about the trie
structure and where payloads are should be stored in FieldInfos.

As we now search on each segment separately, this information can be stored
per segment and also used for each per-segment Filter/Scorer.

The whole thing works out of the box with TrieRangeFilter (its just
iterating over terms, getting TermDocs/TermPositions and setting bits, when
payloads available after checking these), for TrieRangeQuery using
BooleanQuery it is more complicated (MTQ cannot simply add the terms from
the FilteredTermEnum to a BooleanQuery).

Until now I had no time to think about it in detail, but with maybe the
possibility to have TrieRange in Core and store trie-specific FieldInfos per
segment, I will get clearer how to manage this in the API.

Uwe

> On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote:
> > Hi, sorry I missed the first mail.
> >
> >
> >
> > The idea we discussed in Amsterdam during ApacheCon was:
> >
> >
> >
> > Instead of indexing all trie precisions from e.g. the leftmost 8 bits
> downto
> > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions
> 8
> > to 56. The last precision is left out. Instead the last term (precision
> 56)
> > contains the highest precision as payload.
> >
> > On the query side, TrieRangeQuery would create the filter bitmap as
> before
> > until it reaches the lowest available precision with the payloads.
> Instead
> > of further splitting this precision into terms, all TermPositions
> instead of
> > just TermDocs are listed, but only those set in the result BitSet, that
> have
> > the payload inside the range bounds. By this the trie query first
> selects
> > large ranges in the middle like before, but uses the highest (but not
> full
> > precision term) to select more docids than needed but filters them with
> the
> > payload.
> >
> >
> >
> > With String Dates (the simplified example Michael Busch shows in his
> talk):
> >
> > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> > variant would select terms 2005-11-10 to 2005-11-30, then the whole
> > December, the whole years 2006 and 2007 and so on. With payloads,
> trierange
> > would select only whole months (November, December, 2006, 2007, Jan,
> Feb,
> > Mar). At the ends the payloads are used to filter out the days in Nov
> 2005
> > and Mar 2008.
> >
> >
> >
> > With the latest TrieRange impl this would be possible to implement
> (because
> > the TrieTokenStreams now used for indexing could create the payloads).
> Only
> > the searching side would no longer so simple implemented as yet. My
> > biggest problem is how to configure this optimal and make the API clean.
> >
> >
> >
> > Was it understandable? (Its complicated, I know)
> >
> >
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > 
> >
> > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> > Sent: Wednesday, June 10, 2009 7:59 PM
> > To: java-dev@lucene.apache.org
> > Subject: Re: Payloads and TrieRangeQuery
> >
> >
> >
> > I think instead of ORing postings (trie range, rangequery, etc), have a
> > custom Query + Scorer that examines the payload (somehow)?  It could
> encode
> > the multi

Re: Payloads and TrieRangeQuery

Yep, makes sense.  It could be a little slower, but it would decrease
the number of terms indexed by a factor of 256 (for 8 bits).

But the payload part... seems like another case of using that because
CSF isn't there yet, right?
(well, perhaps except if you didn't want to store the field...)

-Yonik
http://www.lucidimagination.com

On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler  wrote:
> Hi, sorry I missed the first mail.
>
>
>
> The idea we discussed in Amsterdam during ApacheCon was:
>
>
>
> Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
> all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
> to 56. The last precision is left out. Instead the last term (precision 56)
> contains the highest precision as payload.
>
> On the query side, TrieRangeQuery would create the filter bitmap as before
> until it reaches the lowest available precision with the payloads. Instead
> of further splitting this precision into terms, all TermPositions instead of
> just TermDocs are listed, but only those set in the result BitSet, that have
> the payload inside the range bounds. By this the trie query first selects
> large ranges in the middle like before, but uses the highest (but not full
> precision term) to select more docids than needed but filters them with the
> payload.
>
>
>
> With String Dates (the simplified example Michael Busch shows in his talk):
>
> Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> variant would select terms 2005-11-10 to 2005-11-30, then the whole
> December, the whole years 2006 and 2007 and so on. With payloads, trierange
> would select only whole months (November, December, 2006, 2007, Jan, Feb,
> Mar). At the ends the payloads are used to filter out the days in Nov 2005
> and Mar 2008.
>
>
>
> With the latest TrieRange impl this would be possible to implement (because
> the TrieTokenStreams now used for indexing could create the payloads). Only
> the searching side would no longer so “simple” implemented as yet. My
> biggest problem is how to configure this optimal and make the API clean.
>
>
>
> Was it understandable? (Its complicated, I know)
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> ____________
>
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: Wednesday, June 10, 2009 7:59 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Payloads and TrieRangeQuery
>
>
>
> I think instead of ORing postings (trie range, rangequery, etc), have a
> custom Query + Scorer that examines the payload (somehow)?  It could encode
> the multiple levels of trie bits in it?  (I'm just guessing here).
>
> On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
>  wrote:
>
> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
>> At the SF Lucene User's group, Michael Busch mentioned using
>> payloads with TrieRangeQueries. Is this something that's being
>> worked on? I'm interested in what sort performance benefits
>> there would be to this method?
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

Ooh that sounds compelling!

So you would not need to use payloads for the "inside" brackets,
right?  Only for the edges?

I wonder how performance would compare.  Without payloads, there are
many more terms (for the tiny ranges) in the index, and your OR query
will have lots of these tiny terms.  But then these tiny terms don't
hit many docs, and with BooleanScorer (which we should switch to for
OR queries) ought not be very costly.  Vs w/ payloads having to use
TermPositions, having to load, decode & check the payload, and I guess
assuming on average that 1/2 the docs are filtered out.

Mike

On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote:
> Hi, sorry I missed the first mail.
>
>
>
> The idea we discussed in Amsterdam during ApacheCon was:
>
>
>
> Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
> all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
> to 56. The last precision is left out. Instead the last term (precision 56)
> contains the highest precision as payload.
>
> On the query side, TrieRangeQuery would create the filter bitmap as before
> until it reaches the lowest available precision with the payloads. Instead
> of further splitting this precision into terms, all TermPositions instead of
> just TermDocs are listed, but only those set in the result BitSet, that have
> the payload inside the range bounds. By this the trie query first selects
> large ranges in the middle like before, but uses the highest (but not full
> precision term) to select more docids than needed but filters them with the
> payload.
>
>
>
> With String Dates (the simplified example Michael Busch shows in his talk):
>
> Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> variant would select terms 2005-11-10 to 2005-11-30, then the whole
> December, the whole years 2006 and 2007 and so on. With payloads, trierange
> would select only whole months (November, December, 2006, 2007, Jan, Feb,
> Mar). At the ends the payloads are used to filter out the days in Nov 2005
> and Mar 2008.
>
>
>
> With the latest TrieRange impl this would be possible to implement (because
> the TrieTokenStreams now used for indexing could create the payloads). Only
> the searching side would no longer so “simple” implemented as yet. My
> biggest problem is how to configure this optimal and make the API clean.
>
>
>
> Was it understandable? (Its complicated, I know)
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> ____________
>
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: Wednesday, June 10, 2009 7:59 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Payloads and TrieRangeQuery
>
>
>
> I think instead of ORing postings (trie range, rangequery, etc), have a
> custom Query + Scorer that examines the payload (somehow)?  It could encode
> the multiple levels of trie bits in it?  (I'm just guessing here).
>
> On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
>  wrote:
>
> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
>> At the SF Lucene User's group, Michael Busch mentioned using
>> payloads with TrieRangeQueries. Is this something that's being
>> worked on? I'm interested in what sort performance benefits
>> there would be to this method?
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

Hi, sorry I missed the first mail.

The idea we discussed in Amsterdam during ApacheCon was:

Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
to 56. The last precision is left out. Instead the last term (precision 56)
contains the highest precision as payload.

On the query side, TrieRangeQuery would create the filter bitmap as before
until it reaches the lowest available precision with the payloads. Instead
of further splitting this precision into terms, all TermPositions instead of
just TermDocs are listed, but only those set in the result BitSet, that have
the payload inside the range bounds. By this the trie query first selects
large ranges in the middle like before, but uses the highest (but not full
precision term) to select more docids than needed but filters them with the
payload.

With String Dates (the simplified example Michael Busch shows in his talk):

Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
variant would select terms 2005-11-10 to 2005-11-30, then the whole
December, the whole years 2006 and 2007 and so on. With payloads, trierange
would select only whole months (November, December, 2006, 2007, Jan, Feb,
Mar). At the ends the payloads are used to filter out the days in Nov 2005
and Mar 2008.

With the latest TrieRange impl this would be possible to implement (because
the TrieTokenStreams now used for indexing could create the payloads). Only
the searching side would no longer so "simple" implemented as yet. My
biggest problem is how to configure this optimal and make the API clean.

Was it understandable? (Its complicated, I know)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
 <http://www.thetaphi.de> http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Wednesday, June 10, 2009 7:59 PM
To: java-dev@lucene.apache.org
Subject: Re: Payloads and TrieRangeQuery

I think instead of ORing postings (trie range, rangequery, etc), have a
custom Query + Scorer that examines the payload (somehow)?  It could encode
the multiple levels of trie bits in it?  (I'm just guessing here).

On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
 wrote:

Use them how?  (Sounds interesting...).

Mike

On Tue, Jun 9, 2009 at 10:32 PM, Jason
Rutherglen wrote:
> At the SF Lucene User's group, Michael Busch mentioned using
> payloads with TrieRangeQueries. Is this something that's being
> worked on? I'm interested in what sort performance benefits
> there would be to this method?
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Jason Rutherglen

I think instead of ORing postings (trie range, rangequery, etc), have a
custom Query + Scorer that examines the payload (somehow)?  It could encode
the multiple levels of trie bits in it?  (I'm just guessing here).

On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
> > At the SF Lucene User's group, Michael Busch mentioned using
> > payloads with TrieRangeQueries. Is this something that's being
> > worked on? I'm interested in what sort performance benefits
> > there would be to this method?
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Payloads and TrieRangeQuery