Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 2:11 PM, Grant Ingersoll wrote: More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. So, I think I see #1 covered, how about #2, #3 and the notion of an AttributeTermQuery? Anyone have thoughts on those? I might have some time next week to work up a Query, as it sounds like fun, but don't hold it to me just yet. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On 6/15/09 10:10 AM, Grant Ingersoll wrote: But, as Michael M reminded me, it is complex, so please accept my apologies. No worries, Grant! I was not really offended, but rather confused... Thanks for clarifying. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Grant Ingersoll wrote: 1. What about Highlighter I would guess Highlighter has not been updated because its kind of a royal * :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark Miller wrote: Grant Ingersoll wrote: On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant If you are using a CachingTokenFilter, and you do something like pass it to something that hasn't upgraded to the new API (say MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) and you are trying to use the new API, you will get an exception when trying to read the tokens from the CachingTokenFilter a second time - obviously because the old API is cached rather than the new, and when you try and use the new, kak :( . We can obviously fix anything internal, but not external. Hmm - actually, even if we fix internal, if you are trying to use the old API, you will have the same issue in reverse ;) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Grant Ingersoll wrote: On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant If you are using a CachingTokenFilter, and you do something like pass it to something that hasn't upgraded to the new API (say MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) and you are trying to use the new API, you will get an exception when trying to read the tokens from the CachingTokenFilter a second time - obviously because the old API is cached rather than the new, and when you try and use the new, kak :( . We can obviously fix anything internal, but not external. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Sounds promising, but I have to think about if there are not side-effects of this change other than a slowdown for people who create multiple tokens (which would be acceptable as you said, because it's not recommended anyway and should be rare). On 6/15/09 1:46 PM, Uwe Schindler wrote: Maybe change the deprecation wrapper around next() and next(Token) [the default impl of incrementToken()] to check, if the retrieved token is not identical to the attribute and then just copy the contents to the instance-Token? This would be a slowdown, but only be the case for very rare TokenStreams that did not reuse token before (and were slow before, too). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Michael Busch [mailto:busch...@gmail.com] *Sent:* Monday, June 15, 2009 10:39 PM *To:* java-dev@lucene.apache.org *Subject:* Re: New Token API was Re: Payloads and TrieRangeQuery I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail:java-dev-unsubscr...@lucene.apache.org <mailto:java-dev-unsubscr...@lucene.apache.org> For additional commands, e-mail:java-dev-h...@lucene.apache.org <mailto:java-dev-h...@lucene.apache.org>
Re: New Token API was Re: Payloads and TrieRangeQuery
yeah about 5 seconds in I saw that and decided to stick with what I know :) On Mon, Jun 15, 2009 at 5:10 PM, Mark Miller wrote: > I may do the Highlighter. Its annoying though - I'll have to break back > compat because Token is part of the public API (Fragmenter, etc). > > Robert Muir wrote: >> >> Michael OK, I plan on adding some tests for the analyzers that don't have >> any. >> >> I didn't try to migrate things such as highlighter, which are >> definitely just as important, only because I'm not familiar with that >> territory. >> >> But I think I can figure out what the various language analyzers are >> trying to do and add tests / convert the remaining ones. >> >> On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: >> >>> >>> I agree. It's my fault, the task of changing the contribs (LUCENE-1460) >>> is >>> assigned to me for a while now - I just haven't found the time to do it >>> yet. >>> >>> It's great that you started the work on that! I'll try to review the >>> patch >>> in the next couple of days and help with fixing the remaining ones. I'd >>> like >>> to get the AttributeSource improvements patch out first. I'll try that >>> tonight. >>> >>> Michael >>> >>> On 6/15/09 1:35 PM, Robert Muir wrote: >>> >>> Michael, again I am terrible with such things myself... >>> >>> Personally I am impressed that you have the back compat, even if you >>> don't change any code at all I think some reformatting of javadocs >>> might make the situation a lot friendlier. I just listed everything >>> that came to my mind immediately. >>> >>> I guess I will also mention that one of the reasons I might not use >>> the new API is that since all filters, etc on the same chain must use >>> the same API, its discouraging if all the contrib stuff doesn't >>> support the new API, it makes me want to just stick with the old so >>> everything will work. So I think contribs being on the new API is >>> really important otherwise no one will want to use it. >>> >>> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: >>> >>> >>> This is excellent feedback, Robert! >>> >>> I agree this is confusing; especially having a deprecated API and only a >>> experimental one that replaces the old one. We need to change that. >>> And I don't like the *useNewAPI*() methods either. I spent a lot of time >>> thinking about backwards compatibility for this API. It's tricky to do >>> without sacrificing performance. In API patches I find myself spending >>> more >>> time for backwards-compatibility than for the actual new feature! :( >>> >>> I'll try to think about how to simplify this confusing old/new API mix. >>> >>> However, we need to make the decisions: >>> a) if we want to release this new API with 2.9, >>> b) if yes to a), if we want to remove the old API in 3.0? >>> >>> If yes to a) and no to b), then we'll have to support both APIs for a >>> presumably very long time, so we then need to have a better solution for >>> the >>> backwards-compatibility here. >>> >>> -Michael >>> >>> On 6/15/09 1:09 PM, Robert Muir wrote: >>> >>> let me try some slightly more constructive feedback: >>> >>> new user looks at TokenStream javadocs: >>> >>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html >>> immediately they see deprecated, text in red with the words >>> "experimental", warnings in bold, the whole thing is scary! >>> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off >>> in a bad way, and its probably the most important method to a new >>> user! >>> there's also a stray bold tag gone haywire somewhere, possibly >>> .incrementToken() >>> >>> from a technical perspective, the documentation is excellent! but for >>> a new user unfamiliar with lucene, its unclear exactly what steps to >>> take: use the scary red experimental api or the old deprecated one? >>> >>> theres also some fairly advanced stuff such as .captureState and >>> .restoreState that might be better in a different place. >>> >>> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing >>> [one is static, one is not], especially because it states all streams >>> and filters in one chain must use the same API, is there a way to >>> simplify this? >>> >>> i'm really terrible with javadocs myself, but perhaps we can come up >>> with a way to improve the presentation... maybe that will make the >>> difference. >>> >>> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: >>> >>> >>> Mark, I'll see if I can get tests produced for some of those analyzers. >>> >>> as a new user of the new api myself, I think I can safely say the most >>> confusing thing about it is having the old deprecated API mixed in the >>> javadocs with it :) >>> >>> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller >>> wrote: >>> >>> >>> Robert Muir wrote: >>> >>> >>> Mark, I created an issue for this. >>> >>> >>> >>> Thanks Robert, great idea. >>> >>> >>> I just think you know, converting an analyzer to the new api is really >>> n
Re: New Token API was Re: Payloads and TrieRangeQuery
I may do the Highlighter. Its annoying though - I'll have to break back compat because Token is part of the public API (Fragmenter, etc). Robert Muir wrote: Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelat
RE: New Token API was Re: Payloads and TrieRangeQuery
Maybe change the deprecation wrapper around next() and next(Token) [the default impl of incrementToken()] to check, if the retrieved token is not identical to the attribute and then just copy the contents to the instance-Token? This would be a slowdown, but only be the case for very rare TokenStreams that did not reuse token before (and were slow before, too). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Michael Busch [mailto:busch...@gmail.com] Sent: Monday, June 15, 2009 10:39 PM To: java-dev@lucene.apache.org Subject: Re: New Token API was Re: Payloads and TrieRangeQuery I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: > I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is > assigned to me for a while now - I just haven't found the time to do it yet. > > It's great that you started the work on that! I'll try to review the patch > in the next couple of days and help with fixing the remaining ones. I'd like > to get the AttributeSource improvements patch out first. I'll try that > tonight. > > Michael > > On 6/15/09 1:35 PM, Robert Muir wrote: > > Michael, again I am terrible with such things myself... > > Personally I am impressed that you have the back compat, even if you > don't change any code at all I think some reformatting of javadocs > might make the situation a lot friendlier. I just listed everything > that came to my mind immediately. > > I guess I will also mention that one of the reasons I might not use > the new API is that since all filters, etc on the same chain must use > the same API, its discouraging if all the contrib stuff doesn't > support the new API, it makes me want to just stick with the old so > everything will work. So I think contribs being on the new API is > really important otherwise no one will want to use it. > > On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: > > > This is excellent feedback, Robert! > > I agree this is confusing; especially having a deprecated API and only a > experimental one that replaces the old one. We need to change that. > And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending more > time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. > > However, we need to make the decisions: > a) if we want to release this new API with 2.9, > b) if yes to a), if we want to remove the old API in 3.0? > > If yes to a) and no to b), then we'll have to support both APIs for a > presumably very long time, so we then need to have a better solution for the > backwards-compatibility here. > > -Michael > > On 6/15/09 1:09 PM, Robert Muir wrote: > > let me try some slightly more constructive feedback: > > new user looks at TokenStream javadocs: > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html > immediately they see deprecated, text in red with the words > "experimental", warnings in bold, the whole thing is scary! > due to the use of 'e.g.' the javadoc for .incrementToken() is cut off > in a bad way, and its probably the most important method to a new > user! > there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() > > from a technical perspective, the documentation is excellent! but for > a new user unfamiliar with lucene, its unclear exactly what steps to > take: use the scary red experimental api or the old deprecated one? > > theres also some fairly advanced stuff such as .captureState and > .restoreState that might be better in a different place. > > finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing > [one is static, one is not], especially because it states all streams > and filters in one chain must use the same API, is there a way to > simplify this? > > i'm really terrible with javadocs myself, but perhaps we can come up > with a way to improve the presentation... maybe that will make the > difference. > > On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > > > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > > > Robert Muir wrote: > > > Mark, I created an issue for this. > > > > Thanks Robert, great idea. > > > I just think you know, converting an analyzer to the new api is really > not that bad. > > > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. > > > reverse engineering what one of them does is not necessarily obvi
Re: New Token API was Re: Payloads and TrieRangeQuery
I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote: > And, in tests: test/o/a/l/index/store is somehow wrong placed. The class > inside should be in test/o/a/l/store. Should I move? Please do! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: > This is excellent feedback, Robert! > > I agree this is confusing; especially having a deprecated API and only a > experimental one that replaces the old one. We need to change that. > And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending more > time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. > > However, we need to make the decisions: > a) if we want to release this new API with 2.9, > b) if yes to a), if we want to remove the old API in 3.0? > > If yes to a) and no to b), then we'll have to support both APIs for a > presumably very long time, so we then need to have a better solution for the > backwards-compatibility here. > > -Michael > > On 6/15/09 1:09 PM, Robert Muir wrote: > > let me try some slightly more constructive feedback: > > new user looks at TokenStream javadocs: > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html > immediately they see deprecated, text in red with the words > "experimental", warnings in bold, the whole thing is scary! > due to the use of 'e.g.' the javadoc for .incrementToken() is cut off > in a bad way, and its probably the most important method to a new > user! > there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() > > from a technical perspective, the documentation is excellent! but for > a new user unfamiliar with lucene, its unclear exactly what steps to > take: use the scary red experimental api or the old deprecated one? > > theres also some fairly advanced stuff such as .captureState and > .restoreState that might be better in a different place. > > finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing > [one is static, one is not], especially because it states all streams > and filters in one chain must use the same API, is there a way to > simplify this? > > i'm really terrible with javadocs myself, but perhaps we can come up > with a way to improve the presentation... maybe that will make the > difference. > > On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > > > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > > > Robert Muir wrote: > > > Mark, I created an issue for this. > > > > Thanks Robert, great idea. > > > I just think you know, converting an analyzer to the new api is really > not that bad. > > > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. > > > reverse engineering what one of them does is not necessarily obvious, > and is completely unrelated but necessary if they are to be migrated. > > I'd be willing to assist with some of this but I don't want to really > work the issue if its gonna be a waste of time at the end of the > day... > > > > The chances of this issue being fully reverted are so remote that I really > wouldnt let that stop you ... > > > On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: > > > > Robert Muir wrote: > > > > As Lucene's contrib hasn't been fully converted either (and its been > quite > some time now), someone has probably heard that groan before. > > > > > hope this doesn't sound like a complaint, > > > > Complaints are fine in any case. Every now and then, it might cause a > little > rant from me or something, but please don't let that dissuade you :) > Who doesnt like to rant and rave now and then. As long as thoughts and > opinions are coming out in a no
RE: New Token API was Re: Payloads and TrieRangeQuery
> And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending > more time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
By the way, there is an empty "de" subdir in SVN inside analysis. Can this be removed? And, in tests: test/o/a/l/index/store is somehow wrong placed. The class inside should be in test/o/a/l/store. Should I move? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Monday, June 15, 2009 10:18 PM > To: java-dev@lucene.apache.org > Subject: RE: New Token API was Re: Payloads and TrieRangeQuery > > > there's also a stray bold tag gone haywire somewhere, possibly > > .incrementToken() > > I fixed this. This was going me on my nerves the whole day when I wrote > javadocs for NumericTokenStream... > > Uwe > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com
Re: New Token API was Re: Payloads and TrieRangeQuery
Some great points - especially the decision between a deprecated API, and a new experimental one subject to change. Bit of a rock and a hard place for a new user. Perhaps we should add a little note with some guidance. - Mark Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For ad
RE: New Token API was Re: Payloads and TrieRangeQuery
> there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() I fixed this. This was going me on my nerves the whole day when I wrote javadocs for NumericTokenStream... Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: >> Robert Muir wrote: >>> >>> Mark, I created an issue for this. >>> >> >> Thanks Robert, great idea. >>> >>> I just think you know, converting an analyzer to the new api is really >>> not that bad. >>> >> >> I don't either. I'm really just complaining about the initial readability. >> Once you know whats up, its not too much different. I just have found myself >> having to refigure out whats up (a short task to be sure) over again after I >> leave it for a while. With the old one, everything was just kind of >> immediately self evident. >> >> That makes me think new users might be a little more confused when they >> first meet again. I'm not a new user though, so its only a guess really. >>> >>> reverse engineering what one of them does is not necessarily obvious, >>> and is completely unrelated but necessary if they are to be migrated. >>> >>> I'd be willing to assist with some of this but I don't want to really >>> work the issue if its gonna be a waste of time at the end of the >>> day... >>> >> >> The chances of this issue being fully reverted are so remote that I really >> wouldnt let that stop you ... >>> >>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: >>> Robert Muir wrote: >> >> As Lucene's contrib hasn't been fully converted either (and its been >> quite >> some time now), someone has probably heard that groan before. >> >> > > hope this doesn't sound like a complaint, > Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. > > but in my opinion this is > because many do not have any tests. > I converted a few of these and its just grunt work but if there are no > tests, its impossible to verify the conversion is correct. > > Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >>> >>> >> >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org F
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > Robert Muir wrote: >> >> Mark, I created an issue for this. >> > > Thanks Robert, great idea. >> >> I just think you know, converting an analyzer to the new api is really >> not that bad. >> > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. >> >> reverse engineering what one of them does is not necessarily obvious, >> and is completely unrelated but necessary if they are to be migrated. >> >> I'd be willing to assist with some of this but I don't want to really >> work the issue if its gonna be a waste of time at the end of the >> day... >> > > The chances of this issue being fully reverted are so remote that I really > wouldnt let that stop you ... >> >> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: >> >>> >>> Robert Muir wrote: >>> > > As Lucene's contrib hasn't been fully converted either (and its been > quite > some time now), someone has probably heard that groan before. > > hope this doesn't sound like a complaint, >>> >>> Complaints are fine in any case. Every now and then, it might cause a >>> little >>> rant from me or something, but please don't let that dissuade you :) >>> Who doesnt like to rant and rave now and then. As long as thoughts and >>> opinions are coming out in a non negative way (which certainly includes >>> complaints), >>> I think its all good. >>> but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. >>> >>> Thanks for pointing that out. We probably get lazy with tests, especially >>> in >>> contrib, and this brings up a good point - we should probably push >>> for tests or write them before committing more often. Sometimes I'm sure >>> it >>> just comes downto a tradeoff though - no resources at the time, >>> the class looked clear cut, and it was just contrib anyway. But then here >>> we >>> are ... a healthy dose of grunt work is bad enough when you have tests to >>> check it. >>> >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >>> >> >> >> >> > > > -- > - Mark > > http://www.lucidimagination.com > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> If you understood that, you'd be able to look > at the actual token value if you were interested in what shift was > used. So it's redundant, has a runtime cost, it's not currently used > anywhere, and it's not useful to fields other than Trie. Perhaps it > shouldn't exist (yet)? You are right, you could also decode the shift value from the first char of the token... I think, I will remove the ShiftAttribute and only set the TermType to highest, lower precisions. By this, one could easily add a payload to the real numeric value using a TokenFilter. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote: > > There is a new Attribute called ShiftAttribute (or > NumericShiftAttribute), > > when trie range is moved to core. This attribute contains the shifted- > away > > bits from the prefix encoded value during trie indexing. > > I was wondering about this > To make use of ShiftAttribute, you need to understand the trie > encoding scheme itself. If you understood that, you'd be able to look > at the actual token value if you were interested in what shift was > used. So it's redundant, has a runtime cost, it's not currently used > anywhere, and it's not useful to fields other than Trie. Perhaps it > shouldn't exist (yet)? The idea was to make the indexing process controllable. You were the one, who asked e.g. for the possibility to add payloads to trie fields and so on. Using the shift attribute, you have full control of the token types. OK, it's a little bit redundant; you could also use the TypeAttribute (which is already used to mark highest precision and lower precision values). One question about the whole TokenStream: In the original case we discussed about Payloads/Position and TrieRange. If this would be implemented in future versions, the question is, how should I set the PositionIncrement/Offsets in the token stream to create a Position of 0 in the index. I do not understand the indexing process here, especially this deprecated boolean flag about something negative (not sure what the name was). Should I set PositionIncrement to 0 for all Trie fields per default. How about PositionIncrementGap, when indexing more than one field? All not really clear. The position would be simplier to implement, but doing this with an attribute, that is indexes together with the other attributes like a payload would be the most ideal solution for future versions of TrieRange. (Maybe we could also use the Offset attribute for the highest precision bits) Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote: > There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), > when trie range is moved to core. This attribute contains the shifted-away > bits from the prefix encoded value during trie indexing. I was wondering about this To make use of ShiftAttribute, you need to understand the trie encoding scheme itself. If you understood that, you'd be able to look at the actual token value if you were interested in what shift was used. So it's redundant, has a runtime cost, it's not currently used anywhere, and it's not useful to fields other than Trie. Perhaps it shouldn't exist (yet)? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> Also, what about the case where one might have attributes that are meant > for downstream TokenFilters, but not necessarily for indexing? Offsets > and type come to mind. Is it the case now that those attributes are not > automatically added to the index? If they are ignored now, what if I > want to add them? I admit, I'm having a hard time finding the code that > specifically loops over the Attributes. I recall seeing it, but can no > longer find it. There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), when trie range is moved to core. This attribute contains the shifted-away bits from the prefix encoded value during trie indexing. The idea is to e.g. have TokenFilters that may additional payloads or others to trie values, but only do this for specific precisions. In future, it may also be interesting to automatically add this attribute to the index. Maybe we should add a read/store method to attributes, that adds an attribute to the Posting using a IndexOutput/IndexInput (like the serialization methods). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I created an issue for this. I just think you know, converting an analyzer to the new api is really not that bad. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: > Robert Muir wrote: >>> >>> As Lucene's contrib hasn't been fully converted either (and its been >>> quite >>> some time now), someone has probably heard that groan before. >>> >> >> hope this doesn't sound like a complaint, > > Complaints are fine in any case. Every now and then, it might cause a little > rant from me or something, but please don't let that dissuade you :) > Who doesnt like to rant and rave now and then. As long as thoughts and > opinions are coming out in a non negative way (which certainly includes > complaints), > I think its all good. >> >> but in my opinion this is >> because many do not have any tests. >> I converted a few of these and its just grunt work but if there are no >> tests, its impossible to verify the conversion is correct. >> > > Thanks for pointing that out. We probably get lazy with tests, especially in > contrib, and this brings up a good point - we should probably push > for tests or write them before committing more often. Sometimes I'm sure it > just comes downto a tradeoff though - no resources at the time, > the class looked clear cut, and it was just contrib anyway. But then here we > are ... a healthy dose of grunt work is bad enough when you have tests to > check it. > > -- > - Mark > > http://www.lucidimagination.com > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant
Re: New Token API was Re: Payloads and TrieRangeQuery
> > As Lucene's contrib hasn't been fully converted either (and its been quite > some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Yonik Seeley wrote: The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com True - its the implementation. And just trying to understand whats going on the first time you see it. Its not particularly difficult, but its also not obvious like the previous API was. As a user, I would ask why that is so, and frankly the answer wouldn't do much for me (as a user). I don't know if most 'standard' users implement their own or not. I will say, and perhaps I was in a special situation, I was writing them and modifying them almost as soon as I started playing with Lucene. And even when I wasnt, I needed to understand the code to understand some of the complexities that could occur, and thankfully, that was breezy to do. Right now, if you told me to go convert all of Solr to the new API you would hear a mighty groan. As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'm not sure why this (currently having to implement next() too) is such an issue for you. You brought it up at the Lucene meetup too. No user will ever have to implement both (the new API and the old) in their streams/filters. The only reason why we did it this way is to not sacrifice performance for existing streams/filters when people switch to Lucene 2.9. I explained this point in the jira issue: http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881 The only time when we'll ever have to implement both APIs is between now and 2.9, only for new streams and filters that we add before 2.9 is released. I don't think it'd be reasonable to consider this disadvantage as a show stopper. It's an issue b/c I don't like writing dead code and who knows when 2.9 will actually be out. I don't think it is a show stopper either. Add on top of it, that the whole point of customizing the chain is to use it in search and, frankly speaking, somehow I think that part of the patch was held back. I'm not sure what you're implying. Could you elaborate? Sorry, see my response to Michael M. on this. I didn't mean to imply you were doing something malicious, just that it always felt half done to me. Knowing you, you don't strike me as someone who does things half way, so that's why I felt it was held back. But, as Michael M reminded me, it is complex, so please accept my apologies. The search side of the API is currently being developed in Lucene-1458. 1458 will not make it into 2.9. Therefore I agree that it is not very advantageous to switch to the new API right now for Lucene users. On the other hand, I don't think it hurts either. I am not sure I agree here. Forcing people to upgrade their analyzers can be quite involved. Analyzers are one of the main areas that people do custom work. Solr, for instance, has 11 custom TokenFilters right now as well as custom Tokenizers, not too mention the ones used during testing that aren't shipped. Upgrading these is a lot of work. I know in previous jobs, I also maintained a fair number TokenStream related stuff. This should not be underestimated. Furthermore, as I said back in the initial discussion, Lucene's Analyzer stuff is often used outside of Lucene. In fact, I often think the Analysis piece should be a standalone jar (not requiring core) and that core should have a dependency on it. In other words, move o.a.l.analysis (and contrib/analsis) to a standalone module that core depends on. This would make it easier for others to consume the Analysis functionality. I personally would vote for reverting until a complete patch that addresses both sides of the problem is submitted and a better solution to cloning is put forth. If we revert now and put a new flexible API like this into 3.x, which I think is necessary to utilize flexible indexing, then we'll have to wait until 4.0 before we can remove the old API. Disadvantages like the one you mentioned above, will then probably be present much longer. I mentioned in the following thread that I have started working on a better way of cloning, which will actually be faster compared to the old API. I'll try to get the code out asap. http://markmail.org/message/q7pgh2qlm2w7cxfx I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. My point here was, at the time, that if others wanted to revert, I probably would vote for it. I'm not proposing we do it, as I think we can make do with what we have. Given the discussion here, I would probably change my mind and not support it now. I think it might be helpful to have some help for people upgrading. Perhaps an abstract class that provides the "core" Token attributes out of the box as a base class that they can then extend? That being said, forcing people to upgrade could at least help them think about the fact that they have no use for the Type attribute or the Offsets attributes. And, testing the cloning stuff would help. I think the current approach underestimates the number of people who need to buffer tokens in memory before handing them out. Sure, it's not as many as the main use case, but it's not zero either. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 12:19 PM, Michael McCandless wrote: I don't think anything was "held back" in this effort. Grant, are you referring to LUCENE-1458? That's "held back" simply because the only person working on it (me) got distracted by other things to work on. I'm sorry, I didn't mean to imply Michael B. was holding back on the work. The patch has always felt half done to me because what's the point of having all of these attributes in the index if you don't have anyway of searching them, thus I was struck by the need to get it in prior to making it available in search.I realize it's complex, but here we are forcing people to upgrade for some future, long term goal. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I thought the primary goal of switching to AttributeSource (yes, the name is very generic...) was to allow extensibility to what's created per-Token, so that an app could add their own attrs without costly subclassing/casting per Token, independent of other other "things" adding their tokens, etc. EG, trie* takes advantage of this extensibility by adding a ShiftAttribute. Subclassing Token in your app wasn't a good solution for various reasons. I do think the API is somewhat more cumbersome than before, and I don't like that about it (consumability!). But net/net I think the change is good, and it's one of the baby steps for flexible indexing (bullet #11): http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard Ie it addresses the flexibility during analysis. I don't think anything was "held back" in this effort. Grant, are you referring to LUCENE-1458? That's "held back" simply because the only person working on it (me) got distracted by other things to work on. Flexible indexing (all of bullet #11) is a complex project, and we need to break it into baby steps like this one. We've already made good progress on it: you can already make custom attrs and a custom (but, package private) indexing chain if you want. Next step is pluggable codecs for writing index files (LUCENE-1458), and APIs for reading them (that generalize Terms/TermDoc/TermPositions we have today). Mike On Sun, Jun 14, 2009 at 11:41 PM, Shai Erera wrote: > The "old" API is deprecated, and therefore when we release 2.9 there might > be some people who'd think they should move away from it, to better prepare > for 3.0 (while in fact this many not be the case). Also, we should make sure > that when we remove all the deprecations, this will still exist (and > therefore, why deprecate it now?), if we think this should indeed be kept > around for at least a while longer. > > I personally am all for keeping it around (it will save me a huge > refactoring of an Analyzer package I wrote), but I have to admit it's only > because I've got quite comfortable with the existing API, and did not have > the time to try the new one yet. > > Shai > > On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller wrote: >> >> Mark Miller wrote: >>> >>> I don't know how I feel about rolling the new token api back. >>> >>> I will say that I originally had no issue with it because I am very >>> excited about Lucene-1458. >>> >>> At the same time though, I'm thinking Lucene-1458 is a very advanced >>> issue that will likely be for really expert usage (though I can see benefits >>> falling to general users). >>> >>> I'm slightly iffy about making an intuitive api much less intuitive for >>> an expert future feature that hasn't fully materialized in Lucene yet. It >>> almost seems like that fight should weigh towards general usage and standard >>> users. >>> >>> I don't have a better proposal though, nor the time to consider it at the >>> moment. I was just more curious if anyone else had any thoughts. I hadn't >>> realized Grant had asked a similar question not long ago >>> with no response. Not sure how to take that, but I'd think that would >>> indicate less problems with people than more. On the other hand, you don't >>> have to switch yet (with trunk) and we have yet to release it. I wonder how >>> many non dev, every day users have really had to tussle with the new API >>> yet. Not many people complaining too loudly at the moment though. >>> >>> Asking for a roll back seems a bit extreme without a little more support >>> behind it than we have seen. >>> >>> - Mark >> >> PS >> >> I know you didnt ask for a rollback Grant - just kind of talking in a >> general manner. I see your point on getting the search side in, I'm just not >> sure I agree that it really matters if one hits before the other. Like Mike >> says, you don't >> have to switch to the new API yet. >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
The "old" API is deprecated, and therefore when we release 2.9 there might be some people who'd think they should move away from it, to better prepare for 3.0 (while in fact this many not be the case). Also, we should make sure that when we remove all the deprecations, this will still exist (and therefore, why deprecate it now?), if we think this should indeed be kept around for at least a while longer. I personally am all for keeping it around (it will save me a huge refactoring of an Analyzer package I wrote), but I have to admit it's only because I've got quite comfortable with the existing API, and did not have the time to try the new one yet. Shai On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller wrote: > Mark Miller wrote: > >> I don't know how I feel about rolling the new token api back. >> >> I will say that I originally had no issue with it because I am very >> excited about Lucene-1458. >> >> At the same time though, I'm thinking Lucene-1458 is a very advanced issue >> that will likely be for really expert usage (though I can see benefits >> falling to general users). >> >> I'm slightly iffy about making an intuitive api much less intuitive for an >> expert future feature that hasn't fully materialized in Lucene yet. It >> almost seems like that fight should weigh towards general usage and standard >> users. >> >> I don't have a better proposal though, nor the time to consider it at the >> moment. I was just more curious if anyone else had any thoughts. I hadn't >> realized Grant had asked a similar question not long ago >> with no response. Not sure how to take that, but I'd think that would >> indicate less problems with people than more. On the other hand, you don't >> have to switch yet (with trunk) and we have yet to release it. I wonder how >> many non dev, every day users have really had to tussle with the new API >> yet. Not many people complaining too loudly at the moment though. >> >> Asking for a roll back seems a bit extreme without a little more support >> behind it than we have seen. >> >> - Mark >> > > PS > > I know you didnt ask for a rollback Grant - just kind of talking in a > general manner. I see your point on getting the search side in, I'm just not > sure I agree that it really matters if one hits before the other. Like Mike > says, you don't > have to switch to the new API yet. > > -- > - Mark > > http://www.lucidimagination.com > > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark Miller wrote: I don't know how I feel about rolling the new token api back. I will say that I originally had no issue with it because I am very excited about Lucene-1458. At the same time though, I'm thinking Lucene-1458 is a very advanced issue that will likely be for really expert usage (though I can see benefits falling to general users). I'm slightly iffy about making an intuitive api much less intuitive for an expert future feature that hasn't fully materialized in Lucene yet. It almost seems like that fight should weigh towards general usage and standard users. I don't have a better proposal though, nor the time to consider it at the moment. I was just more curious if anyone else had any thoughts. I hadn't realized Grant had asked a similar question not long ago with no response. Not sure how to take that, but I'd think that would indicate less problems with people than more. On the other hand, you don't have to switch yet (with trunk) and we have yet to release it. I wonder how many non dev, every day users have really had to tussle with the new API yet. Not many people complaining too loudly at the moment though. Asking for a roll back seems a bit extreme without a little more support behind it than we have seen. - Mark PS I know you didnt ask for a rollback Grant - just kind of talking in a general manner. I see your point on getting the search side in, I'm just not sure I agree that it really matters if one hits before the other. Like Mike says, you don't have to switch to the new API yet. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I don't know how I feel about rolling the new token api back. I will say that I originally had no issue with it because I am very excited about Lucene-1458. At the same time though, I'm thinking Lucene-1458 is a very advanced issue that will likely be for really expert usage (though I can see benefits falling to general users). I'm slightly iffy about making an intuitive api much less intuitive for an expert future feature that hasn't fully materialized in Lucene yet. It almost seems like that fight should weigh towards general usage and standard users. I don't have a better proposal though, nor the time to consider it at the moment. I was just more curious if anyone else had any thoughts. I hadn't realized Grant had asked a similar question not long ago with no response. Not sure how to take that, but I'd think that would indicate less problems with people than more. On the other hand, you don't have to switch yet (with trunk) and we have yet to release it. I wonder how many non dev, every day users have really had to tussle with the new API yet. Not many people complaining too loudly at the moment though. Asking for a roll back seems a bit extreme without a little more support behind it than we have seen. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On 6/14/09 5:17 AM, Grant Ingersoll wrote: Agreed. I've been bringing it up for a while now and made the same comments when it was first introduced, but felt like the lone voice in the wilderness on it and gave way [1], [2], [3]. Now that others are writing/converting, I think it is worth revisiting. I am and always was open to constructive suggestions about how to design this API. I know these new APIs currently don't seem to have many advantages over the previous ones, but they're basically laying the API groundwork for future features like flexible indexing. Some concerns you mentioned were targeted against the first version of the patch in LUCENE-1422. But, you later said you liked how the next patch looked (in thread [2] that you mentioned). That being said, I did just write my first TokenFilter with it, and didn't think it was that hard. There are some gains in it and the API can be simpler if you just need one or two attributes (see DelimitedPayloadTokenFilter), although, just like the move to using char [] in Token, as soon as you do something like store a Token, you lose most of the benefit, I think (for the char [] case, as soon as you need a String in one of your filters, you lose the perf. gain). The annoying parts are that you still have to implement the deprecated next() part, otherwise chances are the thing is unusable by everyone at this point anyway. I'm not sure why this (currently having to implement next() too) is such an issue for you. You brought it up at the Lucene meetup too. No user will ever have to implement both (the new API and the old) in their streams/filters. The only reason why we did it this way is to not sacrifice performance for existing streams/filters when people switch to Lucene 2.9. I explained this point in the jira issue: http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881 The only time when we'll ever have to implement both APIs is between now and 2.9, only for new streams and filters that we add before 2.9 is released. I don't think it'd be reasonable to consider this disadvantage as a show stopper. Add on top of it, that the whole point of customizing the chain is to use it in search and, frankly speaking, somehow I think that part of the patch was held back. I'm not sure what you're implying. Could you elaborate? The search side of the API is currently being developed in Lucene-1458. 1458 will not make it into 2.9. Therefore I agree that it is not very advantageous to switch to the new API right now for Lucene users. On the other hand, I don't think it hurts either. I personally would vote for reverting until a complete patch that addresses both sides of the problem is submitted and a better solution to cloning is put forth. If we revert now and put a new flexible API like this into 3.x, which I think is necessary to utilize flexible indexing, then we'll have to wait until 4.0 before we can remove the old API. Disadvantages like the one you mentioned above, will then probably be present much longer. I mentioned in the following thread that I have started working on a better way of cloning, which will actually be faster compared to the old API. I'll try to get the code out asap. http://markmail.org/message/q7pgh2qlm2w7cxfx I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. -Michael -Grant [1] http://issues.apache.org/jira/browse/LUCENE-1422, [2] http://www.lucidimagination.com/search/document/5daf6d7b8027b4d3/tokenstream_and_token_apis#9e2d0d2b5dc118d4, and the rest of the discussion on that thread. [3] http://www.lucidimagination.com/search/document/4274335abcf31926/new_tokenstream_api_usage On Jun 13, 2009, at 10:32 PM, Mark Miller wrote: What was the big improvement with it again? Advanced, expert custom indexing chains require less casting or something right? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org