subject:"RE\: New Token API was Re\: Payloads and TrieRangeQuery"


Grant Ingersoll wrote:



1. What about Highlighter

I would guess Highlighter has not been updated because its kind of a 
royal * :)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Mark Miller wrote:

Grant Ingersoll wrote:


On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up 
here, that have the same advantages and are more intuitive. We could 
also beef up the documentation and give a better example about how 
to convert a stream/filter from the old to the new API; a 
constructive suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been 
converted.  Also, what are they going to do if the attributes they 
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions 
and offsets, but the analyzer doesn't produce them?  Caveat emptor? 
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are 
meant for downstream TokenFilters, but not necessarily for indexing? 
 Offsets and type come to mind.  Is it the case now that those 
attributes are not automatically added to the index?   If they are 
ignored now, what if I want to add them?  I admit, I'm having a hard 
time finding the code that specifically loops over the Attributes.  I 
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it 
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant
If you are using a CachingTokenFilter, and you do something like pass 
it to something that hasn't upgraded to the new API (say 
MemoryIndex#addField(String fieldName, TokenStream stream, float 
boost)) and you are trying to use the new API,
you will get an exception when trying to read the tokens from the 
CachingTokenFilter a second time - obviously because
the old API is cached rather than the new, and when you try and use 
the new, kak :( .


We can obviously fix anything internal, but not external.

Hmm - actually, even if we fix internal, if you are trying to use the 
old API, you will have the same issue in reverse ;)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Grant Ingersoll wrote:


On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up 
here, that have the same advantages and are more intuitive. We could 
also beef up the documentation and give a better example about how to 
convert a stream/filter from the old to the new API; a constructive 
suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been 
converted.  Also, what are they going to do if the attributes they 
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions 
and offsets, but the analyzer doesn't produce them?  Caveat emptor? 
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are 
meant for downstream TokenFilters, but not necessarily for indexing? 
 Offsets and type come to mind.  Is it the case now that those 
attributes are not automatically added to the index?   If they are 
ignored now, what if I want to add them?  I admit, I'm having a hard 
time finding the code that specifically loops over the Attributes.  I 
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it 
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant
If you are using a CachingTokenFilter, and you do something like pass it 
to something that hasn't upgraded to the new API (say 
MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) 
and you are trying to use the new API,
you will get an exception when trying to read the tokens from the 
CachingTokenFilter a second time - obviously because
the old API is cached rather than the new, and when you try and use the 
new, kak :( .


We can obviously fix anything internal, but not external.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Sounds promising, but I have to think about if there are not 
side-effects of this change other than a slowdown for people who create 
multiple tokens (which would be acceptable as you said, because it's not 
recommended anyway and should be rare).


On 6/15/09 1:46 PM, Uwe Schindler wrote:


Maybe change the deprecation wrapper around next() and next(Token) 
[the default impl of incrementToken()] to check, if the retrieved 
token is not identical to the attribute and then just copy the 
contents to the instance-Token? This would be a slowdown, but only be 
the case for very rare TokenStreams that did not reuse token before 
(and were slow before, too).


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



*From:* Michael Busch [mailto:busch...@gmail.com]
*Sent:* Monday, June 15, 2009 10:39 PM
*To:* java-dev@lucene.apache.org
*Subject:* Re: New Token API was Re: Payloads and TrieRangeQuery

I have implemented most of that actually (the interface part and Token 
implementing all of them).


The problem is a paradigm change with the new API: the assumption is 
that there is always only one single instance of an Attribute. With 
the old API, it is recommended to reuse the passed-in token, but you 
don't have to, you can also return a new one with every call of next().


Now with this change the indexer classes should only know about the 
interfaces, if shouldn't know Token anymore, which seems fine when 
Token implements all those interfaces. BUT, since there can be more 
than once instance of Token, the indexer would have to call 
getAttribute() for all Attributes it needs after each call of next(). 
I haven't measured how expensive that is, but it seems like a severe 
performance hit.


That's basically the main reason why the backwards compatibility is 
ensured in such a goofy way right now.


 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote:


And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending
more time for backwards-compatibility than for the actual new feature! :(
  
I'll try to think about how to simplify this confusing old/new API mix.
 
  
One solution to fix this useNewAPI problem would be to change the

AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.
  
In this case, next() could simply return this Token (which is the all 4

attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.
  
Is this an idea?
  
Uwe
  
  
-

To unsubscribe, e-mail:java-dev-unsubscr...@lucene.apache.org  
<mailto:java-dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail:java-dev-h...@lucene.apache.org  
<mailto:java-dev-h...@lucene.apache.org>

Re: New Token API was Re: Payloads and TrieRangeQuery

yeah about 5 seconds in I saw that and decided to stick with what I know :)

On Mon, Jun 15, 2009 at 5:10 PM, Mark Miller wrote:
> I may do the Highlighter. Its annoying though - I'll have to break back
> compat because Token is part of the public API (Fragmenter, etc).
>
> Robert Muir wrote:
>>
>> Michael OK, I plan on adding some tests for the analyzers that don't have
>> any.
>>
>> I didn't try to migrate things such as highlighter, which are
>> definitely just as important, only because I'm not familiar with that
>> territory.
>>
>> But I think I can figure out what the various language analyzers are
>> trying to do and add tests / convert the remaining ones.
>>
>> On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:
>>
>>>
>>> I agree. It's my fault, the task of changing the contribs (LUCENE-1460)
>>> is
>>> assigned to me for a while now - I just haven't found the time to do it
>>> yet.
>>>
>>> It's great that you started the work on that! I'll try to review the
>>> patch
>>> in the next couple of days and help with fixing the remaining ones. I'd
>>> like
>>> to get the AttributeSource improvements patch out first. I'll try that
>>> tonight.
>>>
>>>  Michael
>>>
>>> On 6/15/09 1:35 PM, Robert Muir wrote:
>>>
>>> Michael, again I am terrible with such things myself...
>>>
>>> Personally I am impressed that you have the back compat, even if you
>>> don't change any code at all I think some reformatting of javadocs
>>> might make the situation a lot friendlier. I just listed everything
>>> that came to my mind immediately.
>>>
>>> I guess I will also mention that one of the reasons I might not use
>>> the new API is that since all filters, etc on the same chain must use
>>> the same API, its discouraging if all the contrib stuff doesn't
>>> support the new API, it makes me want to just stick with the old so
>>> everything will work. So I think contribs being on the new API is
>>> really important otherwise no one will want to use it.
>>>
>>> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
>>>
>>>
>>> This is excellent feedback, Robert!
>>>
>>> I agree this is confusing; especially having a deprecated API and only a
>>> experimental one that replaces the old one. We need to change that.
>>> And I don't like the *useNewAPI*() methods either. I spent a lot of time
>>> thinking about backwards compatibility for this API. It's tricky to do
>>> without sacrificing performance. In API patches I find myself spending
>>> more
>>> time for backwards-compatibility than for the actual new feature! :(
>>>
>>> I'll try to think about how to simplify this confusing old/new API mix.
>>>
>>> However, we need to make the decisions:
>>> a) if we want to release this new API with 2.9,
>>> b) if yes to a), if we want to remove the old API in 3.0?
>>>
>>> If yes to a) and no to b), then we'll have to support both APIs for a
>>> presumably very long time, so we then need to have a better solution for
>>> the
>>> backwards-compatibility here.
>>>
>>> -Michael
>>>
>>> On 6/15/09 1:09 PM, Robert Muir wrote:
>>>
>>> let me try some slightly more constructive feedback:
>>>
>>> new user looks at TokenStream javadocs:
>>>
>>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
>>> immediately they see deprecated, text in red with the words
>>> "experimental", warnings in bold, the whole thing is scary!
>>> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
>>> in a bad way, and its probably the most important method to a new
>>> user!
>>> there's also a stray bold tag gone haywire somewhere, possibly
>>> .incrementToken()
>>>
>>> from a technical perspective, the documentation is excellent! but for
>>> a new user unfamiliar with lucene, its unclear exactly what steps to
>>> take: use the scary red experimental api or the old deprecated one?
>>>
>>> theres also some fairly advanced stuff such as .captureState and
>>> .restoreState that might be better in a different place.
>>>
>>> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
>>> [one is static, one is not], especially because it states all streams
>>> and filters in one chain must use the same API, is there a way to
>>> simplify this?
>>>
>>> i'm really terrible with javadocs myself, but perhaps we can come up
>>> with a way to improve the presentation... maybe that will make the
>>> difference.
>>>
>>> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>>>
>>>
>>> Mark, I'll see if I can get tests produced for some of those analyzers.
>>>
>>> as a new user of the new api myself, I think I can safely say the most
>>> confusing thing about it is having the old deprecated API mixed in the
>>> javadocs with it :)
>>>
>>> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller
>>> wrote:
>>>
>>>
>>> Robert Muir wrote:
>>>
>>>
>>> Mark, I created an issue for this.
>>>
>>>
>>>
>>> Thanks Robert, great idea.
>>>
>>>
>>> I just think you know, converting an analyzer to the new api is really
>>> n

Re: New Token API was Re: Payloads and TrieRangeQuery

I may do the Highlighter. Its annoying though - I'll have to break back
compat because Token is part of the public API (Fragmenter, etc).

Robert Muir wrote:

Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:

I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
assigned to me for a while now - I just haven't found the time to do it yet.

It's great that you started the work on that! I'll try to review the patch
in the next couple of days and help with fixing the remaining ones. I'd like
to get the AttributeSource improvements patch out first. I'll try that
tonight.

Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:

This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending more
time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly
.incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:

Robert Muir wrote:

Mark, I created an issue for this.

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.

I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelat

RE: New Token API was Re: Payloads and TrieRangeQuery

Maybe change the deprecation wrapper around next() and next(Token) [the
default impl of incrementToken()] to check, if the retrieved token is not
identical to the attribute and then just copy the contents to the
instance-Token? This would be a slowdown, but only be the case for very rare
TokenStreams that did not reuse token before (and were slow before, too).

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Michael Busch [mailto:busch...@gmail.com] 
Sent: Monday, June 15, 2009 10:39 PM
To: java-dev@lucene.apache.org
Subject: Re: New Token API was Re: Payloads and TrieRangeQuery

 

I have implemented most of that actually (the interface part and Token
implementing all of them).

The problem is a paradigm change with the new API: the assumption is that
there is always only one single instance of an Attribute. With the old API,
it is recommended to reuse the passed-in token, but you don't have to, you
can also return a new one with every call of next().

Now with this change the indexer classes should only know about the
interfaces, if shouldn't know Token anymore, which seems fine when Token
implements all those interfaces. BUT, since there can be more than once
instance of Token, the indexer would have to call getAttribute() for all
Attributes it needs after each call of next(). I haven't measured how
expensive that is, but it seems like a severe performance hit.

That's basically the main reason why the backwards compatibility is ensured
in such a goofy way right now.

 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote: 

And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(
 
I'll try to think about how to simplify this confusing old/new API mix.


 
One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.
 
In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.
 
Is this an idea?
 
Uwe
 
 
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote:
> I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
> assigned to me for a while now - I just haven't found the time to do it yet.
>
> It's great that you started the work on that! I'll try to review the patch
> in the next couple of days and help with fixing the remaining ones. I'd like
> to get the AttributeSource improvements patch out first. I'll try that
> tonight.
>
>  Michael
>
> On 6/15/09 1:35 PM, Robert Muir wrote:
>
> Michael, again I am terrible with such things myself...
>
> Personally I am impressed that you have the back compat, even if you
> don't change any code at all I think some reformatting of javadocs
> might make the situation a lot friendlier. I just listed everything
> that came to my mind immediately.
>
> I guess I will also mention that one of the reasons I might not use
> the new API is that since all filters, etc on the same chain must use
> the same API, its discouraging if all the contrib stuff doesn't
> support the new API, it makes me want to just stick with the old so
> everything will work. So I think contribs being on the new API is
> really important otherwise no one will want to use it.
>
> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
>
>
> This is excellent feedback, Robert!
>
> I agree this is confusing; especially having a deprecated API and only a
> experimental one that replaces the old one. We need to change that.
> And I don't like the *useNewAPI*() methods either. I spent a lot of time
> thinking about backwards compatibility for this API. It's tricky to do
> without sacrificing performance. In API patches I find myself spending more
> time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.
>
> However, we need to make the decisions:
> a) if we want to release this new API with 2.9,
> b) if yes to a), if we want to remove the old API in 3.0?
>
> If yes to a) and no to b), then we'll have to support both APIs for a
> presumably very long time, so we then need to have a better solution for the
> backwards-compatibility here.
>
> -Michael
>
> On 6/15/09 1:09 PM, Robert Muir wrote:
>
> let me try some slightly more constructive feedback:
>
> new user looks at TokenStream javadocs:
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
> immediately they see deprecated, text in red with the words
> "experimental", warnings in bold, the whole thing is scary!
> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
> in a bad way, and its probably the most important method to a new
> user!
> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()
>
> from a technical perspective, the documentation is excellent! but for
> a new user unfamiliar with lucene, its unclear exactly what steps to
> take: use the scary red experimental api or the old deprecated one?
>
> theres also some fairly advanced stuff such as .captureState and
> .restoreState that might be better in a different place.
>
> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
> [one is static, one is not], especially because it states all streams
> and filters in one chain must use the same API, is there a way to
> simplify this?
>
> i'm really terrible with javadocs myself, but perhaps we can come up
> with a way to improve the presentation... maybe that will make the
> difference.
>
> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>
>
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>
>
> Robert Muir wrote:
>
>
> Mark, I created an issue for this.
>
>
>
> Thanks Robert, great idea.
>
>
> I just think you know, converting an analyzer to the new api is really
> not that bad.
>
>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>
>
> reverse engineering what one of them does is not necessarily obvi

Re: New Token API was Re: Payloads and TrieRangeQuery

I agree. It's my fault, the task of changing the contribs (LUCENE-1460)
is assigned to me for a while now - I just haven't found the time to do
it yet.

It's great that you started the work on that! I'll try to review the
patch in the next couple of days and help with fixing the remaining
ones. I'd like to get the AttributeSource improvements patch out first.
I'll try that tonight.

Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:

This is excellent feedback, Robert!

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:

Robert Muir wrote:

Mark, I created an issue for this.

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:

Robert Muir wrote:

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.

hope this doesn't sound like a complaint

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael McCandless

On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote:

> And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
> inside should be in test/o/a/l/store. Should I move?

Please do!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

I have implemented most of that actually (the interface part and Token 
implementing all of them).


The problem is a paradigm change with the new API: the assumption is 
that there is always only one single instance of an Attribute. With the 
old API, it is recommended to reuse the passed-in token, but you don't 
have to, you can also return a new one with every call of next().


Now with this change the indexer classes should only know about the 
interfaces, if shouldn't know Token anymore, which seems fine when Token 
implements all those interfaces. BUT, since there can be more than once 
instance of Token, the indexer would have to call getAttribute() for all 
Attributes it needs after each call of next(). I haven't measured how 
expensive that is, but it seems like a severe performance hit.


That's basically the main reason why the backwards compatibility is 
ensured in such a goofy way right now.


 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote:

And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending
more time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.
 


One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote:
> This is excellent feedback, Robert!
>
> I agree this is confusing; especially having a deprecated API and only a
> experimental one that replaces the old one. We need to change that.
> And I don't like the *useNewAPI*() methods either. I spent a lot of time
> thinking about backwards compatibility for this API. It's tricky to do
> without sacrificing performance. In API patches I find myself spending more
> time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.
>
> However, we need to make the decisions:
> a) if we want to release this new API with 2.9,
> b) if yes to a), if we want to remove the old API in 3.0?
>
> If yes to a) and no to b), then we'll have to support both APIs for a
> presumably very long time, so we then need to have a better solution for the
> backwards-compatibility here.
>
> -Michael
>
> On 6/15/09 1:09 PM, Robert Muir wrote:
>
> let me try some slightly more constructive feedback:
>
> new user looks at TokenStream javadocs:
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
> immediately they see deprecated, text in red with the words
> "experimental", warnings in bold, the whole thing is scary!
> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
> in a bad way, and its probably the most important method to a new
> user!
> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()
>
> from a technical perspective, the documentation is excellent! but for
> a new user unfamiliar with lucene, its unclear exactly what steps to
> take: use the scary red experimental api or the old deprecated one?
>
> theres also some fairly advanced stuff such as .captureState and
> .restoreState that might be better in a different place.
>
> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
> [one is static, one is not], especially because it states all streams
> and filters in one chain must use the same API, is there a way to
> simplify this?
>
> i'm really terrible with javadocs myself, but perhaps we can come up
> with a way to improve the presentation... maybe that will make the
> difference.
>
> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
>
>
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>
>
> Robert Muir wrote:
>
>
> Mark, I created an issue for this.
>
>
>
> Thanks Robert, great idea.
>
>
> I just think you know, converting an analyzer to the new api is really
> not that bad.
>
>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>
>
> reverse engineering what one of them does is not necessarily obvious,
> and is completely unrelated but necessary if they are to be migrated.
>
> I'd be willing to assist with some of this but I don't want to really
> work the issue if its gonna be a waste of time at the end of the
> day...
>
>
>
> The chances of this issue being fully reverted are so remote that I really
> wouldnt let that stop you ...
>
>
> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>
>
>
> Robert Muir wrote:
>
>
>
> As Lucene's contrib hasn't been fully converted either (and its been
> quite
> some time now), someone has probably heard that groan before.
>
>
>
>
> hope this doesn't sound like a complaint,
>
>
>
> Complaints are fine in any case. Every now and then, it might cause a
> little
> rant from me or something, but please don't let that dissuade you :)
> Who doesnt like to rant and rave now and then. As long as thoughts and
> opinions are coming out in a no

RE: New Token API was Re: Payloads and TrieRangeQuery

> And I don't like the *useNewAPI*() methods either. I spent a lot of time 
> thinking about backwards compatibility for this API. It's tricky to do 
> without sacrificing performance. In API patches I find myself spending 
> more time for backwards-compatibility than for the actual new feature! :(
>
> I'll try to think about how to simplify this confusing old/new API mix.

One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

By the way, there is an empty "de" subdir in SVN inside analysis. Can this
be removed?

And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
inside should be in test/o/a/l/store. Should I move?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Monday, June 15, 2009 10:18 PM
> To: java-dev@lucene.apache.org
> Subject: RE: New Token API was Re: Payloads and TrieRangeQuery
> 
> > there's also a stray bold tag gone haywire somewhere, possibly
> > .incrementToken()
> 
> I fixed this. This was going me on my nerves the whole day when I wrote
> javadocs for NumericTokenStream...
> 
> Uwe
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a 
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(


I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a 
presumably very long time, so we then need to have a better solution for 
the backwards-compatibility here.


-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir  wrote:
   

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller  wrote:
 

Robert Muir wrote:
   

Mark, I created an issue for this.

 

Thanks Robert, great idea.
   

I just think you know, converting an analyzer to the new api is really
not that bad.

 

I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
   

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

 

The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
   

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller  wrote:

 

Robert Muir wrote:

   

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


   

hope this doesn't sound like a complaint,

 

Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

   

  but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.


 

Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com

Re: New Token API was Re: Payloads and TrieRangeQuery

Some great points - especially the decision between a deprecated API, 
and a new experimental one subject to change. Bit of a rock and a hard 
place for a new user.


Perhaps we should add a little note with some guidance.


- Mark

Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
  

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:


Robert Muir wrote:
  

Mark, I created an issue for this.



Thanks Robert, great idea.
  

I just think you know, converting an analyzer to the new api is really
not that bad.



I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
  

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...



The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
  

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:



Robert Muir wrote:

  

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


  

hope this doesn't sound like a complaint,



Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

  

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.




Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  


--
Robert Muir
rcm...@gmail.com






  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For ad

RE: New Token API was Re: Payloads and TrieRangeQuery

> there's also a stray bold tag gone haywire somewhere, possibly
> .incrementToken()

I fixed this. This was going me on my nerves the whole day when I wrote
javadocs for NumericTokenStream...

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
"experimental", warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote:
> Mark, I'll see if I can get tests produced for some of those analyzers.
>
> as a new user of the new api myself, I think I can safely say the most
> confusing thing about it is having the old deprecated API mixed in the
> javadocs with it :)
>
> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
>> Robert Muir wrote:
>>>
>>> Mark, I created an issue for this.
>>>
>>
>> Thanks Robert, great idea.
>>>
>>> I just think you know, converting an analyzer to the new api is really
>>> not that bad.
>>>
>>
>> I don't either. I'm really just complaining about the initial readability.
>> Once you know whats up, its not too much different. I just have found myself
>> having to refigure out whats up (a short task to be sure) over again after I
>> leave it for a while. With the old one, everything was just kind of
>> immediately self evident.
>>
>> That makes me think new users might be a little more confused when they
>> first meet again. I'm not a new user though, so its only a guess really.
>>>
>>> reverse engineering what one of them does is not necessarily obvious,
>>> and is completely unrelated but necessary if they are to be migrated.
>>>
>>> I'd be willing to assist with some of this but I don't want to really
>>> work the issue if its gonna be a waste of time at the end of the
>>> day...
>>>
>>
>> The chances of this issue being fully reverted are so remote that I really
>> wouldnt let that stop you ...
>>>
>>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>>>

 Robert Muir wrote:

>>
>> As Lucene's contrib hasn't been fully converted either (and its been
>> quite
>> some time now), someone has probably heard that groan before.
>>
>>
>
> hope this doesn't sound like a complaint,
>

 Complaints are fine in any case. Every now and then, it might cause a
 little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 complaints),
 I think its all good.

>
>  but in my opinion this is
> because many do not have any tests.
> I converted a few of these and its just grunt work but if there are no
> tests, its impossible to verify the conversion is correct.
>
>

 Thanks for pointing that out. We probably get lazy with tests, especially
 in
 contrib, and this brings up a good point - we should probably push
 for tests or write them before committing more often. Sometimes I'm sure
 it
 just comes downto a tradeoff though - no resources at the time,
 the class looked clear cut, and it was just contrib anyway. But then here
 we
 are ... a healthy dose of grunt work is bad enough when you have tests to
 check it.

 --
 - Mark

 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

>>>
>>>
>>>
>>>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
F

Re: New Token API was Re: Payloads and TrieRangeQuery

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote:
> Robert Muir wrote:
>>
>> Mark, I created an issue for this.
>>
>
> Thanks Robert, great idea.
>>
>> I just think you know, converting an analyzer to the new api is really
>> not that bad.
>>
>
> I don't either. I'm really just complaining about the initial readability.
> Once you know whats up, its not too much different. I just have found myself
> having to refigure out whats up (a short task to be sure) over again after I
> leave it for a while. With the old one, everything was just kind of
> immediately self evident.
>
> That makes me think new users might be a little more confused when they
> first meet again. I'm not a new user though, so its only a guess really.
>>
>> reverse engineering what one of them does is not necessarily obvious,
>> and is completely unrelated but necessary if they are to be migrated.
>>
>> I'd be willing to assist with some of this but I don't want to really
>> work the issue if its gonna be a waste of time at the end of the
>> day...
>>
>
> The chances of this issue being fully reverted are so remote that I really
> wouldnt let that stop you ...
>>
>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
>>
>>>
>>> Robert Muir wrote:
>>>
>
> As Lucene's contrib hasn't been fully converted either (and its been
> quite
> some time now), someone has probably heard that groan before.
>
>

 hope this doesn't sound like a complaint,

>>>
>>> Complaints are fine in any case. Every now and then, it might cause a
>>> little
>>> rant from me or something, but please don't let that dissuade you :)
>>> Who doesnt like to rant and rave now and then. As long as thoughts and
>>> opinions are coming out in a non negative way (which certainly includes
>>> complaints),
>>> I think its all good.
>>>

  but in my opinion this is
 because many do not have any tests.
 I converted a few of these and its just grunt work but if there are no
 tests, its impossible to verify the conversion is correct.


>>>
>>> Thanks for pointing that out. We probably get lazy with tests, especially
>>> in
>>> contrib, and this brings up a good point - we should probably push
>>> for tests or write them before committing more often. Sometimes I'm sure
>>> it
>>> just comes downto a tradeoff though - no resources at the time,
>>> the class looked clear cut, and it was just contrib anyway. But then here
>>> we
>>> are ... a healthy dose of grunt work is bad enough when you have tests to
>>> check it.
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> If you understood that, you'd be able to look
> at the actual token value if you were interested in what shift was
> used.  So it's redundant, has a runtime cost, it's not currently used
> anywhere, and it's not useful to fields other than Trie.  Perhaps it
> shouldn't exist (yet)?

You are right, you could also decode the shift value from the first char of
the token... I think, I will remove the ShiftAttribute and only set the
TermType to highest, lower precisions. By this, one could easily add a
payload to the real numeric value using a TokenFilter.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote:
> > There is a new Attribute called ShiftAttribute (or
> NumericShiftAttribute),
> > when trie range is moved to core. This attribute contains the shifted-
> away
> > bits from the prefix encoded value during trie indexing.
> 
> I was wondering about this
> To make use of ShiftAttribute, you need to understand the trie
> encoding scheme itself.  If you understood that, you'd be able to look
> at the actual token value if you were interested in what shift was
> used.  So it's redundant, has a runtime cost, it's not currently used
> anywhere, and it's not useful to fields other than Trie.  Perhaps it
> shouldn't exist (yet)?

The idea was to make the indexing process controllable. You were the one,
who asked e.g. for the possibility to add payloads to trie fields and so on.
Using the shift attribute, you have full control of the token types. OK,
it's a little bit redundant; you could also use the TypeAttribute (which is
already used to mark highest precision and lower precision values).

One question about the whole TokenStream: In the original case we discussed
about Payloads/Position and TrieRange. If this would be implemented in
future versions, the question is, how should I set the
PositionIncrement/Offsets in the token stream to create a Position of 0 in
the index. I do not understand the indexing process here, especially this
deprecated boolean flag about something negative (not sure what the name
was). Should I set PositionIncrement to 0 for all Trie fields per default.
How about PositionIncrementGap, when indexing more than one field? All not
really clear. The position would be simplier to implement, but doing this
with an attribute, that is indexes together with the other attributes like a
payload would be the most ideal solution for future versions of TrieRange.

(Maybe we could also use the Offset attribute for the highest precision
bits)

Uwe

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Yonik Seeley

On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote:
> There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
> when trie range is moved to core. This attribute contains the shifted-away
> bits from the prefix encoded value during trie indexing.

I was wondering about this
To make use of ShiftAttribute, you need to understand the trie
encoding scheme itself.  If you understood that, you'd be able to look
at the actual token value if you were interested in what shift was
used.  So it's redundant, has a runtime cost, it's not currently used
anywhere, and it's not useful to fields other than Trie.  Perhaps it
shouldn't exist (yet)?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: New Token API was Re: Payloads and TrieRangeQuery

> Also, what about the case where one might have attributes that are meant
> for downstream TokenFilters, but not necessarily for indexing?  Offsets 
> and type come to mind.  Is it the case now that those attributes are not 
> automatically added to the index?   If they are ignored now, what if I 
> want to add them?  I admit, I'm having a hard time finding the code that 
> specifically loops over the Attributes.  I recall seeing it, but can no 
> longer find it.

There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
when trie range is moved to core. This attribute contains the shifted-away
bits from the prefix encoded value during trie indexing. The idea is to e.g.
have TokenFilters that may additional payloads or others to trie values, but
only do this for specific precisions. In future, it may also be interesting
to automatically add this attribute to the index.

Maybe we should add a read/store method to attributes, that adds an
attribute to the Posting using a IndexOutput/IndexInput (like the
serialization methods).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Robert Muir wrote:

Mark, I created an issue for this.
  

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.
  
I don't either. I'm really just complaining about the initial 
readability. Once you know whats up, its not too much different. I just 
have found myself
having to refigure out whats up (a short task to be sure) over again 
after I leave it for a while. With the old one, everything was just kind 
of immediately self evident.


That makes me think new users might be a little more confused when they 
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...
  
The chances of this issue being fully reverted are so remote that I 
really wouldnt let that stop you ...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
  

Robert Muir wrote:


As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
  

Complaints are fine in any case. Every now and then, it might cause a little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.


 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

  

Thanks for pointing that out. We probably get lazy with tests, especially in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org







  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

Mark, I created an issue for this.

I just think you know, converting an analyzer to the new api is really
not that bad.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote:
> Robert Muir wrote:
>>>
>>> As Lucene's contrib hasn't been fully converted either (and its been
>>> quite
>>> some time now), someone has probably heard that groan before.
>>>
>>
>> hope this doesn't sound like a complaint,
>
> Complaints are fine in any case. Every now and then, it might cause a little
> rant from me or something, but please don't let that dissuade you :)
> Who doesnt like to rant and rave now and then. As long as thoughts and
> opinions are coming out in a non negative way (which certainly includes
> complaints),
> I think its all good.
>>
>>  but in my opinion this is
>> because many do not have any tests.
>> I converted a few of these and its just grunt work but if there are no
>> tests, its impossible to verify the conversion is correct.
>>
>
> Thanks for pointing that out. We probably get lazy with tests, especially in
> contrib, and this brings up a good point - we should probably push
> for tests or write them before committing more often. Sometimes I'm sure it
> just comes downto a tradeoff though - no resources at the time,
> the class looked clear cut, and it was just contrib anyway. But then here we
> are ... a healthy dose of grunt work is bad enough when you have tests to
> check it.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery


Robert Muir wrote:

As Lucene's contrib hasn't been fully converted either (and its been quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
Complaints are fine in any case. Every now and then, it might cause a 
little rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and 
opinions are coming out in a non negative way (which certainly includes 
complaints),

I think its all good.

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.
  
Thanks for pointing that out. We probably get lazy with tests, 
especially in contrib, and this brings up a good point - we should 
probably push
for tests or write them before committing more often. Sometimes I'm sure 
it just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then 
here we are ... a healthy dose of grunt work is bad enough when you have 
tests to check it.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll



On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up  
here, that have the same advantages and are more intuitive. We could  
also beef up the documentation and give a better example about how  
to convert a stream/filter from the old to the new API; a  
constructive suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been  
converted.  Also, what are they going to do if the attributes they  
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions  
and offsets, but the analyzer doesn't produce them?  Caveat emptor?  
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are  
meant for downstream TokenFilters, but not necessarily for indexing?   
Offsets and type come to mind.  Is it the case now that those  
attributes are not automatically added to the index?   If they are  
ignored now, what if I want to add them?  I admit, I'm having a hard  
time finding the code that specifically loops over the Attributes.  I  
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it  
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant

Re: New Token API was Re: Payloads and TrieRangeQuery

>
> As Lucene's contrib hasn't been fully converted either (and its been quite
> some time now), someone has probably heard that groan before.

hope this doesn't sound like a complaint, but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery