Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 2:11 PM, Grant Ingersoll wrote: More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. So, I think I see #1 covered, how about #2, #3 and the notion of an AttributeTermQuery? Anyone have thoughts on those? I might have some time next week to work up a Query, as it sounds like fun, but don't hold it to me just yet. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On 6/15/09 10:10 AM, Grant Ingersoll wrote: But, as Michael M reminded me, it is complex, so please accept my apologies. No worries, Grant! I was not really offended, but rather confused... Thanks for clarifying. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Grant Ingersoll wrote: 1. What about Highlighter I would guess Highlighter has not been updated because its kind of a royal * :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark Miller wrote: Grant Ingersoll wrote: On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant If you are using a CachingTokenFilter, and you do something like pass it to something that hasn't upgraded to the new API (say MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) and you are trying to use the new API, you will get an exception when trying to read the tokens from the CachingTokenFilter a second time - obviously because the old API is cached rather than the new, and when you try and use the new, kak :( . We can obviously fix anything internal, but not external. Hmm - actually, even if we fix internal, if you are trying to use the old API, you will have the same issue in reverse ;) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Grant Ingersoll wrote: On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant If you are using a CachingTokenFilter, and you do something like pass it to something that hasn't upgraded to the new API (say MemoryIndex#addField(String fieldName, TokenStream stream, float boost)) and you are trying to use the new API, you will get an exception when trying to read the tokens from the CachingTokenFilter a second time - obviously because the old API is cached rather than the new, and when you try and use the new, kak :( . We can obviously fix anything internal, but not external. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Sounds promising, but I have to think about if there are not side-effects of this change other than a slowdown for people who create multiple tokens (which would be acceptable as you said, because it's not recommended anyway and should be rare). On 6/15/09 1:46 PM, Uwe Schindler wrote: Maybe change the deprecation wrapper around next() and next(Token) [the default impl of incrementToken()] to check, if the retrieved token is not identical to the attribute and then just copy the contents to the instance-Token? This would be a slowdown, but only be the case for very rare TokenStreams that did not reuse token before (and were slow before, too). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Michael Busch [mailto:busch...@gmail.com] *Sent:* Monday, June 15, 2009 10:39 PM *To:* java-dev@lucene.apache.org *Subject:* Re: New Token API was Re: Payloads and TrieRangeQuery I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail:java-dev-unsubscr...@lucene.apache.org <mailto:java-dev-unsubscr...@lucene.apache.org> For additional commands, e-mail:java-dev-h...@lucene.apache.org <mailto:java-dev-h...@lucene.apache.org>
Re: New Token API was Re: Payloads and TrieRangeQuery
yeah about 5 seconds in I saw that and decided to stick with what I know :) On Mon, Jun 15, 2009 at 5:10 PM, Mark Miller wrote: > I may do the Highlighter. Its annoying though - I'll have to break back > compat because Token is part of the public API (Fragmenter, etc). > > Robert Muir wrote: >> >> Michael OK, I plan on adding some tests for the analyzers that don't have >> any. >> >> I didn't try to migrate things such as highlighter, which are >> definitely just as important, only because I'm not familiar with that >> territory. >> >> But I think I can figure out what the various language analyzers are >> trying to do and add tests / convert the remaining ones. >> >> On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: >> >>> >>> I agree. It's my fault, the task of changing the contribs (LUCENE-1460) >>> is >>> assigned to me for a while now - I just haven't found the time to do it >>> yet. >>> >>> It's great that you started the work on that! I'll try to review the >>> patch >>> in the next couple of days and help with fixing the remaining ones. I'd >>> like >>> to get the AttributeSource improvements patch out first. I'll try that >>> tonight. >>> >>> Michael >>> >>> On 6/15/09 1:35 PM, Robert Muir wrote: >>> >>> Michael, again I am terrible with such things myself... >>> >>> Personally I am impressed that you have the back compat, even if you >>> don't change any code at all I think some reformatting of javadocs >>> might make the situation a lot friendlier. I just listed everything >>> that came to my mind immediately. >>> >>> I guess I will also mention that one of the reasons I might not use >>> the new API is that since all filters, etc on the same chain must use >>> the same API, its discouraging if all the contrib stuff doesn't >>> support the new API, it makes me want to just stick with the old so >>> everything will work. So I think contribs being on the new API is >>> really important otherwise no one will want to use it. >>> >>> On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: >>> >>> >>> This is excellent feedback, Robert! >>> >>> I agree this is confusing; especially having a deprecated API and only a >>> experimental one that replaces the old one. We need to change that. >>> And I don't like the *useNewAPI*() methods either. I spent a lot of time >>> thinking about backwards compatibility for this API. It's tricky to do >>> without sacrificing performance. In API patches I find myself spending >>> more >>> time for backwards-compatibility than for the actual new feature! :( >>> >>> I'll try to think about how to simplify this confusing old/new API mix. >>> >>> However, we need to make the decisions: >>> a) if we want to release this new API with 2.9, >>> b) if yes to a), if we want to remove the old API in 3.0? >>> >>> If yes to a) and no to b), then we'll have to support both APIs for a >>> presumably very long time, so we then need to have a better solution for >>> the >>> backwards-compatibility here. >>> >>> -Michael >>> >>> On 6/15/09 1:09 PM, Robert Muir wrote: >>> >>> let me try some slightly more constructive feedback: >>> >>> new user looks at TokenStream javadocs: >>> >>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html >>> immediately they see deprecated, text in red with the words >>> "experimental", warnings in bold, the whole thing is scary! >>> due to the use of 'e.g.' the javadoc for .incrementToken() is cut off >>> in a bad way, and its probably the most important method to a new >>> user! >>> there's also a stray bold tag gone haywire somewhere, possibly >>> .incrementToken() >>> >>> from a technical perspective, the documentation is excellent! but for >>> a new user unfamiliar with lucene, its unclear exactly what steps to >>> take: use the scary red experimental api or the old deprecated one? >>> >>> theres also some fairly advanced stuff such as .captureState and >>> .restoreState that might be better in a different place. >>> >>> finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing >>> [one is static, one is not], especially because it states all streams >>> and filters in one chain must use the same API, is there a way to >>> simplify this? >>> >>> i'm really terrible with javadocs myself, but perhaps we can come up >>> with a way to improve the presentation... maybe that will make the >>> difference. >>> >>> On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: >>> >>> >>> Mark, I'll see if I can get tests produced for some of those analyzers. >>> >>> as a new user of the new api myself, I think I can safely say the most >>> confusing thing about it is having the old deprecated API mixed in the >>> javadocs with it :) >>> >>> On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller >>> wrote: >>> >>> >>> Robert Muir wrote: >>> >>> >>> Mark, I created an issue for this. >>> >>> >>> >>> Thanks Robert, great idea. >>> >>> >>> I just think you know, converting an analyzer to the new api is really >>> n
Some SVN cleanup, was: New Token API was Re: Payloads and TrieRangeQuery
Done, tests pass. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Monday, June 15, 2009 10:40 PM > To: java-dev@lucene.apache.org > Subject: Re: New Token API was Re: Payloads and TrieRangeQuery > > On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote: > > > And, in tests: test/o/a/l/index/store is somehow wrong placed. The class > > inside should be in test/o/a/l/store. Should I move? > > Please do! > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I may do the Highlighter. Its annoying though - I'll have to break back compat because Token is part of the public API (Fragmenter, etc). Robert Muir wrote: Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelat
RE: New Token API was Re: Payloads and TrieRangeQuery
Maybe change the deprecation wrapper around next() and next(Token) [the default impl of incrementToken()] to check, if the retrieved token is not identical to the attribute and then just copy the contents to the instance-Token? This would be a slowdown, but only be the case for very rare TokenStreams that did not reuse token before (and were slow before, too). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Michael Busch [mailto:busch...@gmail.com] Sent: Monday, June 15, 2009 10:39 PM To: java-dev@lucene.apache.org Subject: Re: New Token API was Re: Payloads and TrieRangeQuery I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Busch wrote: > I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is > assigned to me for a while now - I just haven't found the time to do it yet. > > It's great that you started the work on that! I'll try to review the patch > in the next couple of days and help with fixing the remaining ones. I'd like > to get the AttributeSource improvements patch out first. I'll try that > tonight. > > Michael > > On 6/15/09 1:35 PM, Robert Muir wrote: > > Michael, again I am terrible with such things myself... > > Personally I am impressed that you have the back compat, even if you > don't change any code at all I think some reformatting of javadocs > might make the situation a lot friendlier. I just listed everything > that came to my mind immediately. > > I guess I will also mention that one of the reasons I might not use > the new API is that since all filters, etc on the same chain must use > the same API, its discouraging if all the contrib stuff doesn't > support the new API, it makes me want to just stick with the old so > everything will work. So I think contribs being on the new API is > really important otherwise no one will want to use it. > > On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: > > > This is excellent feedback, Robert! > > I agree this is confusing; especially having a deprecated API and only a > experimental one that replaces the old one. We need to change that. > And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending more > time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. > > However, we need to make the decisions: > a) if we want to release this new API with 2.9, > b) if yes to a), if we want to remove the old API in 3.0? > > If yes to a) and no to b), then we'll have to support both APIs for a > presumably very long time, so we then need to have a better solution for the > backwards-compatibility here. > > -Michael > > On 6/15/09 1:09 PM, Robert Muir wrote: > > let me try some slightly more constructive feedback: > > new user looks at TokenStream javadocs: > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html > immediately they see deprecated, text in red with the words > "experimental", warnings in bold, the whole thing is scary! > due to the use of 'e.g.' the javadoc for .incrementToken() is cut off > in a bad way, and its probably the most important method to a new > user! > there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() > > from a technical perspective, the documentation is excellent! but for > a new user unfamiliar with lucene, its unclear exactly what steps to > take: use the scary red experimental api or the old deprecated one? > > theres also some fairly advanced stuff such as .captureState and > .restoreState that might be better in a different place. > > finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing > [one is static, one is not], especially because it states all streams > and filters in one chain must use the same API, is there a way to > simplify this? > > i'm really terrible with javadocs myself, but perhaps we can come up > with a way to improve the presentation... maybe that will make the > difference. > > On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > > > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > > > Robert Muir wrote: > > > Mark, I created an issue for this. > > > > Thanks Robert, great idea. > > > I just think you know, converting an analyzer to the new api is really > not that bad. > > > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. > > > reverse engineering what one of them does is not necessarily obvi
Re: New Token API was Re: Payloads and TrieRangeQuery
I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindler wrote: > And, in tests: test/o/a/l/index/store is somehow wrong placed. The class > inside should be in test/o/a/l/store. Should I move? Please do! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Busch wrote: > This is excellent feedback, Robert! > > I agree this is confusing; especially having a deprecated API and only a > experimental one that replaces the old one. We need to change that. > And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending more > time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. > > However, we need to make the decisions: > a) if we want to release this new API with 2.9, > b) if yes to a), if we want to remove the old API in 3.0? > > If yes to a) and no to b), then we'll have to support both APIs for a > presumably very long time, so we then need to have a better solution for the > backwards-compatibility here. > > -Michael > > On 6/15/09 1:09 PM, Robert Muir wrote: > > let me try some slightly more constructive feedback: > > new user looks at TokenStream javadocs: > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html > immediately they see deprecated, text in red with the words > "experimental", warnings in bold, the whole thing is scary! > due to the use of 'e.g.' the javadoc for .incrementToken() is cut off > in a bad way, and its probably the most important method to a new > user! > there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() > > from a technical perspective, the documentation is excellent! but for > a new user unfamiliar with lucene, its unclear exactly what steps to > take: use the scary red experimental api or the old deprecated one? > > theres also some fairly advanced stuff such as .captureState and > .restoreState that might be better in a different place. > > finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing > [one is static, one is not], especially because it states all streams > and filters in one chain must use the same API, is there a way to > simplify this? > > i'm really terrible with javadocs myself, but perhaps we can come up > with a way to improve the presentation... maybe that will make the > difference. > > On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > > > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > > > Robert Muir wrote: > > > Mark, I created an issue for this. > > > > Thanks Robert, great idea. > > > I just think you know, converting an analyzer to the new api is really > not that bad. > > > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. > > > reverse engineering what one of them does is not necessarily obvious, > and is completely unrelated but necessary if they are to be migrated. > > I'd be willing to assist with some of this but I don't want to really > work the issue if its gonna be a waste of time at the end of the > day... > > > > The chances of this issue being fully reverted are so remote that I really > wouldnt let that stop you ... > > > On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: > > > > Robert Muir wrote: > > > > As Lucene's contrib hasn't been fully converted either (and its been > quite > some time now), someone has probably heard that groan before. > > > > > hope this doesn't sound like a complaint, > > > > Complaints are fine in any case. Every now and then, it might cause a > little > rant from me or something, but please don't let that dissuade you :) > Who doesnt like to rant and rave now and then. As long as thoughts and > opinions are coming out in a no
RE: New Token API was Re: Payloads and TrieRangeQuery
> And I don't like the *useNewAPI*() methods either. I spent a lot of time > thinking about backwards compatibility for this API. It's tricky to do > without sacrificing performance. In API patches I find myself spending > more time for backwards-compatibility than for the actual new feature! :( > > I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
By the way, there is an empty "de" subdir in SVN inside analysis. Can this be removed? And, in tests: test/o/a/l/index/store is somehow wrong placed. The class inside should be in test/o/a/l/store. Should I move? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Monday, June 15, 2009 10:18 PM > To: java-dev@lucene.apache.org > Subject: RE: New Token API was Re: Payloads and TrieRangeQuery > > > there's also a stray bold tag gone haywire somewhere, possibly > > .incrementToken() > > I fixed this. This was going me on my nerves the whole day when I wrote > javadocs for NumericTokenStream... > > Uwe > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com
Re: New Token API was Re: Payloads and TrieRangeQuery
Some great points - especially the decision between a deprecated API, and a new experimental one subject to change. Bit of a rock and a hard place for a new user. Perhaps we should add a little note with some guidance. - Mark Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For ad
RE: New Token API was Re: Payloads and TrieRangeQuery
> there's also a stray bold tag gone haywire somewhere, possibly > .incrementToken() I fixed this. This was going me on my nerves the whole day when I wrote javadocs for NumericTokenStream... Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words "experimental", warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muir wrote: > Mark, I'll see if I can get tests produced for some of those analyzers. > > as a new user of the new api myself, I think I can safely say the most > confusing thing about it is having the old deprecated API mixed in the > javadocs with it :) > > On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: >> Robert Muir wrote: >>> >>> Mark, I created an issue for this. >>> >> >> Thanks Robert, great idea. >>> >>> I just think you know, converting an analyzer to the new api is really >>> not that bad. >>> >> >> I don't either. I'm really just complaining about the initial readability. >> Once you know whats up, its not too much different. I just have found myself >> having to refigure out whats up (a short task to be sure) over again after I >> leave it for a while. With the old one, everything was just kind of >> immediately self evident. >> >> That makes me think new users might be a little more confused when they >> first meet again. I'm not a new user though, so its only a guess really. >>> >>> reverse engineering what one of them does is not necessarily obvious, >>> and is completely unrelated but necessary if they are to be migrated. >>> >>> I'd be willing to assist with some of this but I don't want to really >>> work the issue if its gonna be a waste of time at the end of the >>> day... >>> >> >> The chances of this issue being fully reverted are so remote that I really >> wouldnt let that stop you ... >>> >>> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: >>> Robert Muir wrote: >> >> As Lucene's contrib hasn't been fully converted either (and its been >> quite >> some time now), someone has probably heard that groan before. >> >> > > hope this doesn't sound like a complaint, > Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. > > but in my opinion this is > because many do not have any tests. > I converted a few of these and its just grunt work but if there are no > tests, its impossible to verify the conversion is correct. > > Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >>> >>> >> >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org F
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Miller wrote: > Robert Muir wrote: >> >> Mark, I created an issue for this. >> > > Thanks Robert, great idea. >> >> I just think you know, converting an analyzer to the new api is really >> not that bad. >> > > I don't either. I'm really just complaining about the initial readability. > Once you know whats up, its not too much different. I just have found myself > having to refigure out whats up (a short task to be sure) over again after I > leave it for a while. With the old one, everything was just kind of > immediately self evident. > > That makes me think new users might be a little more confused when they > first meet again. I'm not a new user though, so its only a guess really. >> >> reverse engineering what one of them does is not necessarily obvious, >> and is completely unrelated but necessary if they are to be migrated. >> >> I'd be willing to assist with some of this but I don't want to really >> work the issue if its gonna be a waste of time at the end of the >> day... >> > > The chances of this issue being fully reverted are so remote that I really > wouldnt let that stop you ... >> >> On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: >> >>> >>> Robert Muir wrote: >>> > > As Lucene's contrib hasn't been fully converted either (and its been > quite > some time now), someone has probably heard that groan before. > > hope this doesn't sound like a complaint, >>> >>> Complaints are fine in any case. Every now and then, it might cause a >>> little >>> rant from me or something, but please don't let that dissuade you :) >>> Who doesnt like to rant and rave now and then. As long as thoughts and >>> opinions are coming out in a non negative way (which certainly includes >>> complaints), >>> I think its all good. >>> but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. >>> >>> Thanks for pointing that out. We probably get lazy with tests, especially >>> in >>> contrib, and this brings up a good point - we should probably push >>> for tests or write them before committing more often. Sometimes I'm sure >>> it >>> just comes downto a tradeoff though - no resources at the time, >>> the class looked clear cut, and it was just contrib anyway. But then here >>> we >>> are ... a healthy dose of grunt work is bad enough when you have tests to >>> check it. >>> >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >>> >> >> >> >> > > > -- > - Mark > > http://www.lucidimagination.com > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> If you understood that, you'd be able to look > at the actual token value if you were interested in what shift was > used. So it's redundant, has a runtime cost, it's not currently used > anywhere, and it's not useful to fields other than Trie. Perhaps it > shouldn't exist (yet)? You are right, you could also decode the shift value from the first char of the token... I think, I will remove the ShiftAttribute and only set the TermType to highest, lower precisions. By this, one could easily add a payload to the real numeric value using a TokenFilter. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote: > > There is a new Attribute called ShiftAttribute (or > NumericShiftAttribute), > > when trie range is moved to core. This attribute contains the shifted- > away > > bits from the prefix encoded value during trie indexing. > > I was wondering about this > To make use of ShiftAttribute, you need to understand the trie > encoding scheme itself. If you understood that, you'd be able to look > at the actual token value if you were interested in what shift was > used. So it's redundant, has a runtime cost, it's not currently used > anywhere, and it's not useful to fields other than Trie. Perhaps it > shouldn't exist (yet)? The idea was to make the indexing process controllable. You were the one, who asked e.g. for the possibility to add payloads to trie fields and so on. Using the shift attribute, you have full control of the token types. OK, it's a little bit redundant; you could also use the TypeAttribute (which is already used to mark highest precision and lower precision values). One question about the whole TokenStream: In the original case we discussed about Payloads/Position and TrieRange. If this would be implemented in future versions, the question is, how should I set the PositionIncrement/Offsets in the token stream to create a Position of 0 in the index. I do not understand the indexing process here, especially this deprecated boolean flag about something negative (not sure what the name was). Should I set PositionIncrement to 0 for all Trie fields per default. How about PositionIncrementGap, when indexing more than one field? All not really clear. The position would be simplier to implement, but doing this with an attribute, that is indexes together with the other attributes like a payload would be the most ideal solution for future versions of TrieRange. (Maybe we could also use the Offset attribute for the highest precision bits) Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindler wrote: > There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), > when trie range is moved to core. This attribute contains the shifted-away > bits from the prefix encoded value during trie indexing. I was wondering about this To make use of ShiftAttribute, you need to understand the trie encoding scheme itself. If you understood that, you'd be able to look at the actual token value if you were interested in what shift was used. So it's redundant, has a runtime cost, it's not currently used anywhere, and it's not useful to fields other than Trie. Perhaps it shouldn't exist (yet)? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
> Also, what about the case where one might have attributes that are meant > for downstream TokenFilters, but not necessarily for indexing? Offsets > and type come to mind. Is it the case now that those attributes are not > automatically added to the index? If they are ignored now, what if I > want to add them? I admit, I'm having a hard time finding the code that > specifically loops over the Attributes. I recall seeing it, but can no > longer find it. There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), when trie range is moved to core. This attribute contains the shifted-away bits from the prefix encoded value during trie indexing. The idea is to e.g. have TokenFilters that may additional payloads or others to trie values, but only do this for specific precisions. In future, it may also be interesting to automatically add this attribute to the index. Maybe we should add a read/store method to attributes, that adds an attribute to the Posting using a IndexOutput/IndexInput (like the serialization methods). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I created an issue for this. I just think you know, converting an analyzer to the new api is really not that bad. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... On Mon, Jun 15, 2009 at 1:55 PM, Mark Miller wrote: > Robert Muir wrote: >>> >>> As Lucene's contrib hasn't been fully converted either (and its been >>> quite >>> some time now), someone has probably heard that groan before. >>> >> >> hope this doesn't sound like a complaint, > > Complaints are fine in any case. Every now and then, it might cause a little > rant from me or something, but please don't let that dissuade you :) > Who doesnt like to rant and rave now and then. As long as thoughts and > opinions are coming out in a non negative way (which certainly includes > complaints), > I think its all good. >> >> but in my opinion this is >> because many do not have any tests. >> I converted a few of these and its just grunt work but if there are no >> tests, its impossible to verify the conversion is correct. >> > > Thanks for pointing that out. We probably get lazy with tests, especially in > contrib, and this brings up a good point - we should probably push > for tests or write them before committing more often. Sometimes I'm sure it > just comes downto a tradeoff though - no resources at the time, > the class looked clear cut, and it was just contrib anyway. But then here we > are ... a healthy dose of grunt work is bad enough when you have tests to > check it. > > -- > - Mark > > http://www.lucidimagination.com > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant
Re: New Token API was Re: Payloads and TrieRangeQuery
> > As Lucene's contrib hasn't been fully converted either (and its been quite > some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Yonik Seeley wrote: The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com True - its the implementation. And just trying to understand whats going on the first time you see it. Its not particularly difficult, but its also not obvious like the previous API was. As a user, I would ask why that is so, and frankly the answer wouldn't do much for me (as a user). I don't know if most 'standard' users implement their own or not. I will say, and perhaps I was in a special situation, I was writing them and modifying them almost as soon as I started playing with Lucene. And even when I wasnt, I needed to understand the code to understand some of the complexities that could occur, and thankfully, that was breezy to do. Right now, if you told me to go convert all of Solr to the new API you would hear a mighty groan. As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'm not sure why this (currently having to implement next() too) is such an issue for you. You brought it up at the Lucene meetup too. No user will ever have to implement both (the new API and the old) in their streams/filters. The only reason why we did it this way is to not sacrifice performance for existing streams/filters when people switch to Lucene 2.9. I explained this point in the jira issue: http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881 The only time when we'll ever have to implement both APIs is between now and 2.9, only for new streams and filters that we add before 2.9 is released. I don't think it'd be reasonable to consider this disadvantage as a show stopper. It's an issue b/c I don't like writing dead code and who knows when 2.9 will actually be out. I don't think it is a show stopper either. Add on top of it, that the whole point of customizing the chain is to use it in search and, frankly speaking, somehow I think that part of the patch was held back. I'm not sure what you're implying. Could you elaborate? Sorry, see my response to Michael M. on this. I didn't mean to imply you were doing something malicious, just that it always felt half done to me. Knowing you, you don't strike me as someone who does things half way, so that's why I felt it was held back. But, as Michael M reminded me, it is complex, so please accept my apologies. The search side of the API is currently being developed in Lucene-1458. 1458 will not make it into 2.9. Therefore I agree that it is not very advantageous to switch to the new API right now for Lucene users. On the other hand, I don't think it hurts either. I am not sure I agree here. Forcing people to upgrade their analyzers can be quite involved. Analyzers are one of the main areas that people do custom work. Solr, for instance, has 11 custom TokenFilters right now as well as custom Tokenizers, not too mention the ones used during testing that aren't shipped. Upgrading these is a lot of work. I know in previous jobs, I also maintained a fair number TokenStream related stuff. This should not be underestimated. Furthermore, as I said back in the initial discussion, Lucene's Analyzer stuff is often used outside of Lucene. In fact, I often think the Analysis piece should be a standalone jar (not requiring core) and that core should have a dependency on it. In other words, move o.a.l.analysis (and contrib/analsis) to a standalone module that core depends on. This would make it easier for others to consume the Analysis functionality. I personally would vote for reverting until a complete patch that addresses both sides of the problem is submitted and a better solution to cloning is put forth. If we revert now and put a new flexible API like this into 3.x, which I think is necessary to utilize flexible indexing, then we'll have to wait until 4.0 before we can remove the old API. Disadvantages like the one you mentioned above, will then probably be present much longer. I mentioned in the following thread that I have started working on a better way of cloning, which will actually be faster compared to the old API. I'll try to get the code out asap. http://markmail.org/message/q7pgh2qlm2w7cxfx I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. My point here was, at the time, that if others wanted to revert, I probably would vote for it. I'm not proposing we do it, as I think we can make do with what we have. Given the discussion here, I would probably change my mind and not support it now. I think it might be helpful to have some help for people upgrading. Perhaps an abstract class that provides the "core" Token attributes out of the box as a base class that they can then extend? That being said, forcing people to upgrade could at least help them think about the fact that they have no use for the Type attribute or the Offsets attributes. And, testing the cloning stuff would help. I think the current approach underestimates the number of people who need to buffer tokens in memory before handing them out. Sure, it's not as many as the main use case, but it's not zero either. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 12:19 PM, Michael McCandless wrote: I don't think anything was "held back" in this effort. Grant, are you referring to LUCENE-1458? That's "held back" simply because the only person working on it (me) got distracted by other things to work on. I'm sorry, I didn't mean to imply Michael B. was holding back on the work. The patch has always felt half done to me because what's the point of having all of these attributes in the index if you don't have anyway of searching them, thus I was struck by the need to get it in prior to making it available in search.I realize it's complex, but here we are forcing people to upgrade for some future, long term goal. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I thought the primary goal of switching to AttributeSource (yes, the name is very generic...) was to allow extensibility to what's created per-Token, so that an app could add their own attrs without costly subclassing/casting per Token, independent of other other "things" adding their tokens, etc. EG, trie* takes advantage of this extensibility by adding a ShiftAttribute. Subclassing Token in your app wasn't a good solution for various reasons. I do think the API is somewhat more cumbersome than before, and I don't like that about it (consumability!). But net/net I think the change is good, and it's one of the baby steps for flexible indexing (bullet #11): http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard Ie it addresses the flexibility during analysis. I don't think anything was "held back" in this effort. Grant, are you referring to LUCENE-1458? That's "held back" simply because the only person working on it (me) got distracted by other things to work on. Flexible indexing (all of bullet #11) is a complex project, and we need to break it into baby steps like this one. We've already made good progress on it: you can already make custom attrs and a custom (but, package private) indexing chain if you want. Next step is pluggable codecs for writing index files (LUCENE-1458), and APIs for reading them (that generalize Terms/TermDoc/TermPositions we have today). Mike On Sun, Jun 14, 2009 at 11:41 PM, Shai Erera wrote: > The "old" API is deprecated, and therefore when we release 2.9 there might > be some people who'd think they should move away from it, to better prepare > for 3.0 (while in fact this many not be the case). Also, we should make sure > that when we remove all the deprecations, this will still exist (and > therefore, why deprecate it now?), if we think this should indeed be kept > around for at least a while longer. > > I personally am all for keeping it around (it will save me a huge > refactoring of an Analyzer package I wrote), but I have to admit it's only > because I've got quite comfortable with the existing API, and did not have > the time to try the new one yet. > > Shai > > On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller wrote: >> >> Mark Miller wrote: >>> >>> I don't know how I feel about rolling the new token api back. >>> >>> I will say that I originally had no issue with it because I am very >>> excited about Lucene-1458. >>> >>> At the same time though, I'm thinking Lucene-1458 is a very advanced >>> issue that will likely be for really expert usage (though I can see benefits >>> falling to general users). >>> >>> I'm slightly iffy about making an intuitive api much less intuitive for >>> an expert future feature that hasn't fully materialized in Lucene yet. It >>> almost seems like that fight should weigh towards general usage and standard >>> users. >>> >>> I don't have a better proposal though, nor the time to consider it at the >>> moment. I was just more curious if anyone else had any thoughts. I hadn't >>> realized Grant had asked a similar question not long ago >>> with no response. Not sure how to take that, but I'd think that would >>> indicate less problems with people than more. On the other hand, you don't >>> have to switch yet (with trunk) and we have yet to release it. I wonder how >>> many non dev, every day users have really had to tussle with the new API >>> yet. Not many people complaining too loudly at the moment though. >>> >>> Asking for a roll back seems a bit extreme without a little more support >>> behind it than we have seen. >>> >>> - Mark >> >> PS >> >> I know you didnt ask for a rollback Grant - just kind of talking in a >> general manner. I see your point on getting the search side in, I'm just not >> sure I agree that it really matters if one hits before the other. Like Mike >> says, you don't >> have to switch to the new API yet. >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
The "old" API is deprecated, and therefore when we release 2.9 there might be some people who'd think they should move away from it, to better prepare for 3.0 (while in fact this many not be the case). Also, we should make sure that when we remove all the deprecations, this will still exist (and therefore, why deprecate it now?), if we think this should indeed be kept around for at least a while longer. I personally am all for keeping it around (it will save me a huge refactoring of an Analyzer package I wrote), but I have to admit it's only because I've got quite comfortable with the existing API, and did not have the time to try the new one yet. Shai On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller wrote: > Mark Miller wrote: > >> I don't know how I feel about rolling the new token api back. >> >> I will say that I originally had no issue with it because I am very >> excited about Lucene-1458. >> >> At the same time though, I'm thinking Lucene-1458 is a very advanced issue >> that will likely be for really expert usage (though I can see benefits >> falling to general users). >> >> I'm slightly iffy about making an intuitive api much less intuitive for an >> expert future feature that hasn't fully materialized in Lucene yet. It >> almost seems like that fight should weigh towards general usage and standard >> users. >> >> I don't have a better proposal though, nor the time to consider it at the >> moment. I was just more curious if anyone else had any thoughts. I hadn't >> realized Grant had asked a similar question not long ago >> with no response. Not sure how to take that, but I'd think that would >> indicate less problems with people than more. On the other hand, you don't >> have to switch yet (with trunk) and we have yet to release it. I wonder how >> many non dev, every day users have really had to tussle with the new API >> yet. Not many people complaining too loudly at the moment though. >> >> Asking for a roll back seems a bit extreme without a little more support >> behind it than we have seen. >> >> - Mark >> > > PS > > I know you didnt ask for a rollback Grant - just kind of talking in a > general manner. I see your point on getting the search side in, I'm just not > sure I agree that it really matters if one hits before the other. Like Mike > says, you don't > have to switch to the new API yet. > > -- > - Mark > > http://www.lucidimagination.com > > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark Miller wrote: I don't know how I feel about rolling the new token api back. I will say that I originally had no issue with it because I am very excited about Lucene-1458. At the same time though, I'm thinking Lucene-1458 is a very advanced issue that will likely be for really expert usage (though I can see benefits falling to general users). I'm slightly iffy about making an intuitive api much less intuitive for an expert future feature that hasn't fully materialized in Lucene yet. It almost seems like that fight should weigh towards general usage and standard users. I don't have a better proposal though, nor the time to consider it at the moment. I was just more curious if anyone else had any thoughts. I hadn't realized Grant had asked a similar question not long ago with no response. Not sure how to take that, but I'd think that would indicate less problems with people than more. On the other hand, you don't have to switch yet (with trunk) and we have yet to release it. I wonder how many non dev, every day users have really had to tussle with the new API yet. Not many people complaining too loudly at the moment though. Asking for a roll back seems a bit extreme without a little more support behind it than we have seen. - Mark PS I know you didnt ask for a rollback Grant - just kind of talking in a general manner. I see your point on getting the search side in, I'm just not sure I agree that it really matters if one hits before the other. Like Mike says, you don't have to switch to the new API yet. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I don't know how I feel about rolling the new token api back. I will say that I originally had no issue with it because I am very excited about Lucene-1458. At the same time though, I'm thinking Lucene-1458 is a very advanced issue that will likely be for really expert usage (though I can see benefits falling to general users). I'm slightly iffy about making an intuitive api much less intuitive for an expert future feature that hasn't fully materialized in Lucene yet. It almost seems like that fight should weigh towards general usage and standard users. I don't have a better proposal though, nor the time to consider it at the moment. I was just more curious if anyone else had any thoughts. I hadn't realized Grant had asked a similar question not long ago with no response. Not sure how to take that, but I'd think that would indicate less problems with people than more. On the other hand, you don't have to switch yet (with trunk) and we have yet to release it. I wonder how many non dev, every day users have really had to tussle with the new API yet. Not many people complaining too loudly at the moment though. Asking for a roll back seems a bit extreme without a little more support behind it than we have seen. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On 6/14/09 5:17 AM, Grant Ingersoll wrote: Agreed. I've been bringing it up for a while now and made the same comments when it was first introduced, but felt like the lone voice in the wilderness on it and gave way [1], [2], [3]. Now that others are writing/converting, I think it is worth revisiting. I am and always was open to constructive suggestions about how to design this API. I know these new APIs currently don't seem to have many advantages over the previous ones, but they're basically laying the API groundwork for future features like flexible indexing. Some concerns you mentioned were targeted against the first version of the patch in LUCENE-1422. But, you later said you liked how the next patch looked (in thread [2] that you mentioned). That being said, I did just write my first TokenFilter with it, and didn't think it was that hard. There are some gains in it and the API can be simpler if you just need one or two attributes (see DelimitedPayloadTokenFilter), although, just like the move to using char [] in Token, as soon as you do something like store a Token, you lose most of the benefit, I think (for the char [] case, as soon as you need a String in one of your filters, you lose the perf. gain). The annoying parts are that you still have to implement the deprecated next() part, otherwise chances are the thing is unusable by everyone at this point anyway. I'm not sure why this (currently having to implement next() too) is such an issue for you. You brought it up at the Lucene meetup too. No user will ever have to implement both (the new API and the old) in their streams/filters. The only reason why we did it this way is to not sacrifice performance for existing streams/filters when people switch to Lucene 2.9. I explained this point in the jira issue: http://issues.apache.org/jira/browse/LUCENE-1422?focusedCommentId=12644881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12644881 The only time when we'll ever have to implement both APIs is between now and 2.9, only for new streams and filters that we add before 2.9 is released. I don't think it'd be reasonable to consider this disadvantage as a show stopper. Add on top of it, that the whole point of customizing the chain is to use it in search and, frankly speaking, somehow I think that part of the patch was held back. I'm not sure what you're implying. Could you elaborate? The search side of the API is currently being developed in Lucene-1458. 1458 will not make it into 2.9. Therefore I agree that it is not very advantageous to switch to the new API right now for Lucene users. On the other hand, I don't think it hurts either. I personally would vote for reverting until a complete patch that addresses both sides of the problem is submitted and a better solution to cloning is put forth. If we revert now and put a new flexible API like this into 3.x, which I think is necessary to utilize flexible indexing, then we'll have to wait until 4.0 before we can remove the old API. Disadvantages like the one you mentioned above, will then probably be present much longer. I mentioned in the following thread that I have started working on a better way of cloning, which will actually be faster compared to the old API. I'll try to get the code out asap. http://markmail.org/message/q7pgh2qlm2w7cxfx I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. -Michael -Grant [1] http://issues.apache.org/jira/browse/LUCENE-1422, [2] http://www.lucidimagination.com/search/document/5daf6d7b8027b4d3/tokenstream_and_token_apis#9e2d0d2b5dc118d4, and the rest of the discussion on that thread. [3] http://www.lucidimagination.com/search/document/4274335abcf31926/new_tokenstream_api_usage On Jun 13, 2009, at 10:32 PM, Mark Miller wrote: What was the big improvement with it again? Advanced, expert custom indexing chains require less casting or something right? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
New Token API was Re: Payloads and TrieRangeQuery
Agreed. I've been bringing it up for a while now and made the same comments when it was first introduced, but felt like the lone voice in the wilderness on it and gave way [1], [2], [3]. Now that others are writing/converting, I think it is worth revisiting. That being said, I did just write my first TokenFilter with it, and didn't think it was that hard. There are some gains in it and the API can be simpler if you just need one or two attributes (see DelimitedPayloadTokenFilter), although, just like the move to using char [] in Token, as soon as you do something like store a Token, you lose most of the benefit, I think (for the char [] case, as soon as you need a String in one of your filters, you lose the perf. gain). The annoying parts are that you still have to implement the deprecated next() part, otherwise chances are the thing is unusable by everyone at this point anyway. Add on top of it, that the whole point of customizing the chain is to use it in search and, frankly speaking, somehow I think that part of the patch was held back. I personally would vote for reverting until a complete patch that addresses both sides of the problem is submitted and a better solution to cloning is put forth. -Grant [1] http://issues.apache.org/jira/browse/LUCENE-1422, [2] http://www.lucidimagination.com/search/document/5daf6d7b8027b4d3/tokenstream_and_token_apis#9e2d0d2b5dc118d4 , and the rest of the discussion on that thread. [3] http://www.lucidimagination.com/search/document/4274335abcf31926/new_tokenstream_api_usage On Jun 13, 2009, at 10:32 PM, Mark Miller wrote: Yonik Seeley wrote: Even non-API changes have tradeoffs... the indexing improvements (err, total rewrite) made that code *much* harder to understand and debug. It's a net win since the indexing performance improvements were so fantastic. I agree - very hard to follow, worth the improvements. Just to throw something out, the new Token API is not very consumable in my experience. The old one was very intuitive and very easy to follow the code. I've had to refigure out what the heck was going on with the new one more than once now. Writing some example code with it is hard to follow or justify to a new user. What was the big improvement with it again? Advanced, expert custom indexing chains require less casting or something right? I dunno - anyone else have any thoughts now that the new API has been in circulation for some time? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
> Just to throw something out, the new Token API is not very consumable in my > experience. The old one was very intuitive and very easy to follow the code. > > I've had to refigure out what the heck was going on with the new one more > than once now. Writing some example code with it is hard to follow or > justify to a new user. > > What was the big improvement with it again? Advanced, expert custom indexing > chains require less casting or something right? > > I dunno - anyone else have any thoughts now that the new API has been in > circulation for some time? I have an advanced, expert custom indexing chain, and it's still not ported over the new API. It's counter intuitive alright, with names not really saying what's going on (please, for an AttributeSource, whose Attribute is it? Attribute is a quality of 'something', but that 'something' is amiss), but the biggest problem for me is that it capitalizes on the idea of token stream even further, making filters whose output is several times the input tokenwise, or which need to inspect a number of tokens before emitting something - much harder to write. I most probably missed something and there IS a way not to trash your memory with non-reused linkedhashmaps, but than again, there's no pointers. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
Yonik Seeley wrote: Even non-API changes have tradeoffs... the indexing improvements (err, total rewrite) made that code *much* harder to understand and debug. It's a net win since the indexing performance improvements were so fantastic. I agree - very hard to follow, worth the improvements. Just to throw something out, the new Token API is not very consumable in my experience. The old one was very intuitive and very easy to follow the code. I've had to refigure out what the heck was going on with the new one more than once now. Writing some example code with it is hard to follow or justify to a new user. What was the big improvement with it again? Advanced, expert custom indexing chains require less casting or something right? I dunno - anyone else have any thoughts now that the new API has been in circulation for some time? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Jun 13, 2009, at 8:58 AM, Michael McCandless wrote: OK, good points Grant. I now agree that it's not a simple task, moving stuff core stuff from Solr -> Lucene. So summing this all up: * Some feel Lucene should only aim to be the core "expert" engine used by Solr/Nutch/etc., so things like moving trie to core (with consumable naming, good defaults, etc.) are near zero priority. I agree on the engine part, but don't agree on the expert part. Many people who have their own frameworks and needs should be able to plug in Lucene and it should just work. Likewise, there is a huge install base that must be thought of. Still Solr/Nutch are the single largest users of Lucene and, wearing my PMC hat, I think it makes sense that we make it obvious for newbies coming in where their time is best spent. If someone shows up in Solr-land and just needs Lucene because they want to be next to the metal, we should tell them that. Likewise if they don't want to spend time doing warming, faceting, etc. they should just go use Solr. Also, I have no problem with Trie being in core. If someone wants to do it, go for it. That's how it all works anyway. Do-acracy in action. It's not a priority for me, but that shouldn't stop anyone else. While I see & agree that this is indeed what Solr needs of Lucene, I still think direct consumbility of Lucene is important and Lucene should try to have a consumable API, good names for classes methods, good defaults, etc. Agreed, although I think all the deprecation stuff severely limits Lucene's consumability. You, as a writer of LIA know this first hand, and I also experience this first hand when doing Lucene training. As I've pointed out countless times lately, so much cruft builds up in Lucene by the time that we get to X.Y release (for Y > 2, as in 2.2) that consumability suffers greatly. And I don't see those two goals as being in conflict (ie, I don't see Lucene having a consumable API as preventing Solr from using Lucene's advanced APIs), except for the fact that we all have limited time. * We have two communities. Each has its own goal (to make its product good), it's own committers, etc. While technically we seem to agree certain things (function queries, NumberUtils, highlighters, analyzers, faceted nav, etc.) logically "belong" as Lucene modules, the logistics and work required and different requirements (both one time, and ongoing) are in fact sizable challenges/barriers. I take the "I know where to put it when I do it approach", but as is obvious, not everyone has that luxury b/c they aren't committers on both projects. Integrating Tika into Solr was logical, while the DelimitedPayload stuff logically belonged in contrib/analyzers (to me anyway, and one of my primary motivations for that patch is to easily enable Payloads in Solr w/o having to modify how Solr works). Likewise, I think it makes sense for Solr's analyzers (WordDelimiter) to be in contrib/analyzers too, but I don't particularly think moving Solr's faceting stuff to Lucene is necessarily core to Lucene. As seems to be my theme lately, I take it on a "case-by-case" basis. Perhaps once Lucene "modularizes", in the future, such consolidation may be easier, ie if/once there are committers focused on "analyzers" I could seem them helping out all around in pulling all analyzers together. * We all are obviously busy and there are more important things to work on than "shuffling stuff around". +1 -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
Of course consumability (good APIs) is important, but rational people can disagree when it comes to the specifics... many things come with tradeoffs. Even non-API changes have tradeoffs... the indexing improvements (err, total rewrite) made that code *much* harder to understand and debug. It's a net win since the indexing performance improvements were so fantastic. Deprecating an existing class that's been around for a while simply for the purposes of slightly improving the naming may not be worth it (it's destroying a little piece of the collective memory of all developers who have used Lucene). Avoiding having to specify the type of a sort... I'm skeptical about the benefits vs potentially decreased flexibility and increased code size and complexity - but it's all hypothetical at this point. But on a positive note, Lucene 2.9 looks like it's suddenly progressing fast enough that it's feasible to use it in Solr 1.4 (which I've lobbied for in Solr-land) - which seems like it would be a win for both Lucene and Solr. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
Very true write up Grant! On Sat, Jun 13, 2009 at 2:58 PM, Michael McCandless wrote: > OK, good points Grant. I now agree that it's not a simple task, > moving stuff core stuff from Solr -> Lucene. So summing this all up: > > * Some feel Lucene should only aim to be the core "expert" engine > used by Solr/Nutch/etc., so things like moving trie to core (with > consumable naming, good defaults, etc.) are near zero priority. > > While I see & agree that this is indeed what Solr needs of Lucene, > I still think direct consumbility of Lucene is important and > Lucene should try to have a consumable API, good names for classes > methods, good defaults, etc. > > And I don't see those two goals as being in conflict (ie, I don't > see Lucene having a consumable API as preventing Solr from using > Lucene's advanced APIs), except for the fact that we all have > limited time. > > * We have two communities. Each has its own goal (to make its > product good), it's own committers, etc. While technically we > seem to agree certain things (function queries, NumberUtils, > highlighters, analyzers, faceted nav, etc.) logically "belong" as > Lucene modules, the logistics and work required and different > requirements (both one time, and ongoing) are in fact sizable > challenges/barriers. I personally think that we should make it a requirement for a move from Solr to Lucene there should be a patch available for the integration into Solr. There is still a possibility that the solr code will change before the patch can be applied but it will be still easier for the solr team to integrate it. > > Perhaps once Lucene "modularizes", in the future, such > consolidation may be easier, ie if/once there are committers > focused on "analyzers" I could seem them helping out all > around in pulling all analyzers together. For questionable Solr stuff there could still be space in contrib to make the great work from the solr community available to others not using solr directly. This way it would also be possible to give solr committers write access to those contrib modules which in turn simplyfies the integration back to solr. I think it's great to offer fine grained modules of solr features as contribs of lucene while keeping the core "clean". Just an idea... > > * We all are obviously busy and there are more important things to > work on than "shuffling stuff around". > > So now I'm off to scrutinize LUCENE-1313... :) > > Mike > > On Fri, Jun 12, 2009 at 5:33 PM, Grant Ingersoll wrote: >> >> On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote: >> >>> On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley >>> wrote: >>> In Solr land we can quickly hack something together, spend some time thinking about the external HTTP interface, and immediately make it available to users (those using nightlies at least). It would be a huge burden to say to Solr that anything of interest to the Lucene community should be pulled out into a module that Solr should then use. >>> >>> Sure, new and exciting things should still stay private to Solr... >>> As a separate project, Solr is (and should be) free to follow what's in it's own best interest. >>> >>> Of course! >>> >>> I see your point, that moving things down into Lucene is added cost: >>> we have to get consensus that it's a good thing to move (but should >>> not be hard for many things), do all the mechanics to "transplant" the >>> code, take Lucene's "different" requirements into account (that the >>> consumability & stability of the Java API is important), etc. >> >> The problem traditionally has been that people only do the work one way. >> That is, they take it from Solr, but then they never submit patches to Solr >> to use the version in Lucene. And, since many of the Lucene committers are >> not Solr committers, even if they do the Solr work, they can't see it >> through. >> >> It seems all the pure Lucene devs want the functionality of Solr, but they >> don't want to do any of the work to remove the duplication from Solr. >> Additionally, it is often the case that by the time it gets into Lucene, >> some Solr user has come along and improved the Solr version. The Function >> stuff is example numero uno. >> >> Wearing my PMC hat, I'd say if people are going to be moving stuff around >> like this, then they better be keeping Solr up to date, too, because it is >> otherwise creating a lot of work for Solr to the detriment of it (because >> that time could be spent doing other things). Still, I don't think that is >> all that worthwhile, as it will just create a ton of extra work. People who >> want Solr stuff are free to pull what they need into their project. There >> is absolutely nothing stopping them. >> >> And the fact is, that no matter how much is pulled out of Solr, people will >> still contribute things to Solr because it is it's own community and is >> fairly autonomous, a f
Re: Payloads and TrieRangeQuery
OK, good points Grant. I now agree that it's not a simple task, moving stuff core stuff from Solr -> Lucene. So summing this all up: * Some feel Lucene should only aim to be the core "expert" engine used by Solr/Nutch/etc., so things like moving trie to core (with consumable naming, good defaults, etc.) are near zero priority. While I see & agree that this is indeed what Solr needs of Lucene, I still think direct consumbility of Lucene is important and Lucene should try to have a consumable API, good names for classes methods, good defaults, etc. And I don't see those two goals as being in conflict (ie, I don't see Lucene having a consumable API as preventing Solr from using Lucene's advanced APIs), except for the fact that we all have limited time. * We have two communities. Each has its own goal (to make its product good), it's own committers, etc. While technically we seem to agree certain things (function queries, NumberUtils, highlighters, analyzers, faceted nav, etc.) logically "belong" as Lucene modules, the logistics and work required and different requirements (both one time, and ongoing) are in fact sizable challenges/barriers. Perhaps once Lucene "modularizes", in the future, such consolidation may be easier, ie if/once there are committers focused on "analyzers" I could seem them helping out all around in pulling all analyzers together. * We all are obviously busy and there are more important things to work on than "shuffling stuff around". So now I'm off to scrutinize LUCENE-1313... :) Mike On Fri, Jun 12, 2009 at 5:33 PM, Grant Ingersoll wrote: > > On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote: > >> On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley >> wrote: >> >>> In Solr land we can quickly hack something together, spend some time >>> thinking about the external HTTP interface, and immediately make it >>> available to users (those using nightlies at least). It would be a >>> huge burden to say to Solr that anything of interest to the Lucene >>> community should be pulled out into a module that Solr should then >>> use. >> >> Sure, new and exciting things should still stay private to Solr... >> >>> As a separate project, Solr is (and should be) free to follow >>> what's in it's own best interest. >> >> Of course! >> >> I see your point, that moving things down into Lucene is added cost: >> we have to get consensus that it's a good thing to move (but should >> not be hard for many things), do all the mechanics to "transplant" the >> code, take Lucene's "different" requirements into account (that the >> consumability & stability of the Java API is important), etc. > > The problem traditionally has been that people only do the work one way. > That is, they take it from Solr, but then they never submit patches to Solr > to use the version in Lucene. And, since many of the Lucene committers are > not Solr committers, even if they do the Solr work, they can't see it > through. > > It seems all the pure Lucene devs want the functionality of Solr, but they > don't want to do any of the work to remove the duplication from Solr. > Additionally, it is often the case that by the time it gets into Lucene, > some Solr user has come along and improved the Solr version. The Function > stuff is example numero uno. > > Wearing my PMC hat, I'd say if people are going to be moving stuff around > like this, then they better be keeping Solr up to date, too, because it is > otherwise creating a lot of work for Solr to the detriment of it (because > that time could be spent doing other things). Still, I don't think that is > all that worthwhile, as it will just create a ton of extra work. People who > want Solr stuff are free to pull what they need into their project. There > is absolutely nothing stopping them. > > And the fact is, that no matter how much is pulled out of Solr, people will > still contribute things to Solr because it is it's own community and is > fairly autonomous, a few committers that cross over not withstanding. I'd > venture a fair number of Solr committers know little about Lucene internals. > Heck, given the amount of work you do, Mike, I'd say a fair number of > Lucene committers know very little about the internals of Lucene anymore. > It has been good to see you over in Solr land at least watching what is > going on there to at least help coordinate when Solr finds Lucene errors. > > >> >> But, there is a huge benefit to having it in Lucene: you get a wider >> community involved to help further improve it, you make Lucene >> stronger which improves its & Solr's adoption, etc. >> > > That is not always the case. Pushing things into Lucene from Solr make it > harder for Solr committers to do their work, unless you are proposing that > all Solr committers should be Lucene committers. > > As for adoption, most people probably should just be starting with Solr > anyway. The
Re: Payloads and TrieRangeQuery
On Jun 12, 2009, at 12:20 PM, Michael McCandless wrote: On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley> wrote: In Solr land we can quickly hack something together, spend some time thinking about the external HTTP interface, and immediately make it available to users (those using nightlies at least). It would be a huge burden to say to Solr that anything of interest to the Lucene community should be pulled out into a module that Solr should then use. Sure, new and exciting things should still stay private to Solr... As a separate project, Solr is (and should be) free to follow what's in it's own best interest. Of course! I see your point, that moving things down into Lucene is added cost: we have to get consensus that it's a good thing to move (but should not be hard for many things), do all the mechanics to "transplant" the code, take Lucene's "different" requirements into account (that the consumability & stability of the Java API is important), etc. The problem traditionally has been that people only do the work one way. That is, they take it from Solr, but then they never submit patches to Solr to use the version in Lucene. And, since many of the Lucene committers are not Solr committers, even if they do the Solr work, they can't see it through. It seems all the pure Lucene devs want the functionality of Solr, but they don't want to do any of the work to remove the duplication from Solr. Additionally, it is often the case that by the time it gets into Lucene, some Solr user has come along and improved the Solr version. The Function stuff is example numero uno. Wearing my PMC hat, I'd say if people are going to be moving stuff around like this, then they better be keeping Solr up to date, too, because it is otherwise creating a lot of work for Solr to the detriment of it (because that time could be spent doing other things). Still, I don't think that is all that worthwhile, as it will just create a ton of extra work. People who want Solr stuff are free to pull what they need into their project. There is absolutely nothing stopping them. And the fact is, that no matter how much is pulled out of Solr, people will still contribute things to Solr because it is it's own community and is fairly autonomous, a few committers that cross over not withstanding. I'd venture a fair number of Solr committers know little about Lucene internals. Heck, given the amount of work you do, Mike, I'd say a fair number of Lucene committers know very little about the internals of Lucene anymore. It has been good to see you over in Solr land at least watching what is going on there to at least help coordinate when Solr finds Lucene errors. But, there is a huge benefit to having it in Lucene: you get a wider community involved to help further improve it, you make Lucene stronger which improves its & Solr's adoption, etc. That is not always the case. Pushing things into Lucene from Solr make it harder for Solr committers to do their work, unless you are proposing that all Solr committers should be Lucene committers. As for adoption, most people probably should just be starting with Solr anyway. The fact is that every Lucene committer to the tee will tell you that they have built something that more or less looks like Solr. Lucene is great as a low-level Vector Space implementation with some nice contribs, but much of the interesting stuff in search these days happens at the layer up (and arguably even a layer above that in terms of UI and intelligent search, etc). In Lucene PMC land, that area is Solr and Nutch. My personal opinion is that Lucene should focus on being a really fast, core search library and that the outlet for the higher level stuff is in Solr and Nutch. It is usually obvious when things belong in the core, because people bring them up in the appropriate place (there are some rare exceptions, that you have mentioned) -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Thu, Jun 11, 2009 at 4:58 PM, Yonik Seeley wrote: > In Solr land we can quickly hack something together, spend some time > thinking about the external HTTP interface, and immediately make it > available to users (those using nightlies at least). It would be a > huge burden to say to Solr that anything of interest to the Lucene > community should be pulled out into a module that Solr should then > use. Sure, new and exciting things should still stay private to Solr... > As a separate project, Solr is (and should be) free to follow > what's in it's own best interest. Of course! I see your point, that moving things down into Lucene is added cost: we have to get consensus that it's a good thing to move (but should not be hard for many things), do all the mechanics to "transplant" the code, take Lucene's "different" requirements into account (that the consumability & stability of the Java API is important), etc. But, there is a huge benefit to having it in Lucene: you get a wider community involved to help further improve it, you make Lucene stronger which improves its & Solr's adoption, etc. What's good for Lucene is good for Solr. Eg why hasn't NumberUtils been folded into Lucene, aeons ago? I realize it's not the perfect solution (and trie* seems to be better), but it's certainly better than the "nothing" we've had for a long time... Why not the custom fragmenters (Gap, Regex) that Solr has to improve highlighting? EG it looks like Solr can approximately produce sentences as fragments. This would be a great addition to Lucene's highlighter. (NOTE: I fully realize that a large number of things do get moved from Solr to Lucene, over time, and that's great; I'm saying we should very much keep that up). But of course we are as usual resource starved... >> For example, Solr would presumably prefer that trie* remain in contrib? > > From a capabilities perspective, it doesn't matter much if it's in > contrib or core I think. It's a small amount of work to adapt to > class name changes, but nothing to complain about. > > But it doesn't seem like Trie should be treated specially somehow... Trie is *very* useful. It plugs a serious weakness in Lucene (ootb handling of numeric fields). The things one must now do to have a numeric field work "properly" are crazy. Trie makes Lucene more useful & consumable; it's a powerful feature. It should be treated specially. But: I certainly see your point, that Solr could care less about such consumability, and leaving trie in contrib would be just fine (from Solr's standpoint). > seems to go down a path that makes customer provided filters > second-class citizens. I hate it when Java does stuff like that to me > (their provided classes can do more than mine). I don't really see the connection here. If we make trie* the default for handling of numeric fields in Lucene, how does that hurt customer provided filters? >> There's a single set of Solr developers, but a very wide range of >> direct Lucene users. I don't see how Lucene having good consumability >> actually makes Solr's life harder. Those raw APIs would still be >> accessible to Solr... simple things should be simple (direct Lucene >> users) and complex things should be possible (Solr). > > But with changes come deprecations - forced changes when the > deprecations are removed. Sometimes those are easy to adapt to, > sometimes not. If those required changes don't actually add any > functionality to Solr, it's a net negative if you're looking at it > from Solr's point of view. That doesn't mean Lucene shouldn't - and > I've not complained to Lucene in the past because it wasn't Lucene's > responsibility. Right, so this is why the move of trie from contrib -> core is net negative for Solr (things work fine now, and it only creates work for you). But, if it improves Lucene's adoption, because Lucene is more consumable, that then becomes a positive for Solr. And BTW you should "complain" to Lucene more if we're doing things that are not Solr friendly. Honestly, if anything, we don't hear enough from you ;) Such complaints will presumably often match this one ("Solr wants a raw engine; Lucene wants consumability") and we'll just have to agree to disagree. But other times I'm sure we'd get something net/net good out of the resulting discussion. >>> and taking it out of Solr's release cycle and easy ability to change - >>> if Solr needs to make a change to one of the moved classes, it's >>> necessary to get it through the Lucene change process and then upgrade >>> to the latest Lucene trunk - all or nothing. >> >> "Getting through Lucene's change process" should be real simple for >> you all :) > > Y'r kidding, right? ;-) > It's sometimes hard enough to get stuff through either community, let > alone both. Actually I wasn't kidding! Sure there's the "normal" open-source challenges -- getting someone's attention, the mechanics of making a patch & iterating, sometimes massive unrelated d
Re: Payloads and TrieRangeQuery
In Solr land we can quickly hack something together, spend some time thinking about the external HTTP interface, and immediately make it available to users (those using nightlies at least). It would be a huge burden to say to Solr that anything of interest to the Lucene community should be pulled out into a module that Solr should then use. As a separate project, Solr is (and should be) free to follow what's in it's own best interest. [...] > For example, Solr would presumably prefer that trie* remain in contrib? >From a capabilities perspective, it doesn't matter much if it's in contrib or core I think. It's a small amount of work to adapt to class name changes, but nothing to complain about. But it doesn't seem like Trie should be treated specially somehow... seems to go down a path that makes customer provided filters second-class citizens. I hate it when Java does stuff like that to me (their provided classes can do more than mine). > There's a single set of Solr developers, but a very wide range of > direct Lucene users. I don't see how Lucene having good consumability > actually makes Solr's life harder. Those raw APIs would still be > accessible to Solr... simple things should be simple (direct Lucene > users) and complex things should be possible (Solr). But with changes come deprecations - forced changes when the deprecations are removed. Sometimes those are easy to adapt to, sometimes not. If those required changes don't actually add any functionality to Solr, it's a net negative if you're looking at it from Solr's point of view. That doesn't mean Lucene shouldn't - and I've not complained to Lucene in the past because it wasn't Lucene's responsibility. [...] >> and taking it out of Solr's release cycle and easy ability to change - >> if Solr needs to make a change to one of the moved classes, it's >> necessary to get it through the Lucene change process and then upgrade >> to the latest Lucene trunk - all or nothing. > > "Getting through Lucene's change process" should be real simple for > you all :) Y'r kidding, right? ;-) It's sometimes hard enough to get stuff through either community, let alone both. For code that's in Solr, we only have to worry about Solr's concerns, not about all users of Lucene. Big difference. > And, Solr upgrades Lucene's JAR fairly often already? And a lot of our users don't like it. It's also become much more difficult due to all the Lucene changes lately. It's something we should be doing less of, not more of unless we formally merge the projects or something ;-) > NumericField would enforce one token, during indexing. That only changes the point at which the user realizes "uh, this won't work", and it's still at a point after they've written their code. Checking like this doesn't even feel like it belongs at the indexing level. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Thu, Jun 11, 2009 at 9:20 AM, Uwe Schindler wrote: > In my opinion, solr and lucene should exchange technology much more. Solr > should concentrate on the "search server" and lucene should provide the > technology. +1 > All additional implementations inside solr like faceting and so > on, could also live in lucene. I would have nice usages for it (I do not use > Solr, but have my own Lucene framework, that I cannot give up because of > various reasons). But e.g. Solr's facetting, Solr's analyzers and so on > would be great in lucene as "modules". +1 > I get a lot of questions all the > time, how to do this and that, because the people don't understand, why they > must first map the float to an int. If they do it, the next question is: > "why does this work, I do not want to loose precision" and so on. I will do > it with TrieTokenStream Exactly! Those are new users struggling with trie... Lucene's consumability is important. > In my opinion, the classes should stay Trie* and not Numeric*. Maybe we have > different implementations for numeric Range queries in future. In this case, > I think Yonik is right. The class name should describe how it works if there > may be different and incompatible implementations in future. But... we don't name our other classes according to how they're implemented? EG, JavaCCQueryParser or JFlexQueryParser, or writeSingleDocSegmentInRAM (instead of addDocument, pre-2.3), or bufferPendingDelete (instead of deleteDocument), or RangeQueryAsBooleanQuery, RangeQueryAsConstantScoreFilter, SortUsingFieldCache, etc. I absolutely love "trie", as a developer, but the average user won't know what the trie data structure is. TrieRangeQuery is less consumable than NumericRangeQuery (or, simply RangeQuery). As a user I know Lucene is doing all kinds of cool stuff under the hood to get its job done... but those cool things shouldn't be in the names. The name should describe what it does, not how... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Thu, Jun 11, 2009 at 8:46 AM, Yonik Seeley wrote: >>> Really goes into Solr land... my pref for Lucene is to remain a core >>> expert-level full-text search library and keep out things that are >>> easy to do in an application or at another level. >> >> I think this must be the crux of our disagreement. > > Indeed. The itch to scratch w.r.t Solr in Lucene is increased core > functionality, not more magic (that duplicates what Solr already does, > but just in a different way and thus makes the lives of Solr > developers harder). But... Solr's needs are very different from direct users of Lucene. I completely agree that Solr needs & wants only the low-level APIs in Lucene, a raw engine, that doesn't bother with good defaults, consumability, etc. Just the raw stuff. If Lucene existed only for Solr, we'd be done here. But Lucene is used by many direct users, and those users benefit from good defaults & consumability. For example, Solr would presumably prefer that trie* remain in contrib? There's a single set of Solr developers, but a very wide range of direct Lucene users. I don't see how Lucene having good consumability actually makes Solr's life harder. Those raw APIs would still be accessible to Solr... simple things should be simple (direct Lucene users) and complex things should be possible (Solr). BTW, I don't mean to "pick" on trie*; I think there are many other examples where we could improve Lucene's consumability. EG, for highlighter, you should pretty much always use its SpanScorer; yet, it's completely non-obvious (even having read the javadocs) how to do so. Why isn't this the default scorer for Highlighter? Such a situation doesn't affect Solr: you all are experts on all aspects of Lucene, and you can figure it out. But your average user will do the obvious thing but then notice highlighting for phrase searches is buggy, and conclude Lucene is buggy and go and prefer the other search engine they are testing. It's a trap. Lucene's consumability is important. > If we asked on java-user about people's priorities/wishes, I bet > column stride fields, near real time indexing, and better performance > would dominate stuff like not having to specify how to sort a field. I think all of the above are important :) >> I feel, instead, that Lucene should stand on its own, as a useful >> search library, with a consumable API, good defaults, etc. Lucene is >> more than "the expert level search API that's embedded in >> Solr". Lucene is consumed directly by apps other than Solr. >> >> In fact, I think there are many things in Solr that naturally belong >> in Lucene (and over time we've been gradually slurping them down). >> The line/criteria has always been rather blurry... > > And conversely, Solr isn't just a wrapper around Lucene and an > incubator for Lucene technology. Of course not: there's lots of good stuff in Solr that should stay in Solr. But eg the neat analyzers/tokenizers, search filters, faceted nav, custom collectors, function queries (now diverged), CharFilter (in progress), improvements to highlighter, etc., should really all be in Lucene instead (as "modules")? > Ask Lucene users if they would like pretty much any substantial piece > of functionality in Solr moved to Lucene as a module and you'll > probably get an affirmative answer. But moving something from Solr to > Lucene can have a lot of negative effects for Solr, including taking > it out of the hands of Solr committers who aren't Lucene committers, We should simply make such Solr committers Lucene committers, if they are indeed working on stuff that should be in Lucene? > and taking it out of Solr's release cycle and easy ability to change - > if Solr needs to make a change to one of the moved classes, it's > necessary to get it through the Lucene change process and then upgrade > to the latest Lucene trunk - all or nothing. "Getting through Lucene's change process" should be real simple for you all :) And, Solr upgrades Lucene's JAR fairly often already? > It's also the case that the goals of Lucene classes and Solr classes > are often very different. Lucene is more concerned with Java APIs (as > should be the case), while they are a bit more secondary in Solr... > the external APIs are of primary importance and one doesn't worry as > much (or at all) about the classes implementing that interface or it's > Java API back compatibility (as a generalization... it depends on the > class). Different, yes, but not incompatible? >> In Lucene, we should be able to add a NumericField to a document, >> index it, and then create RangeFilter or Sort on that field and have >> things "just work". > > That feels like a false sense of simplicity, and Lucene isn't for > dummies ;-) One needs to understand how things work under the hood to > avoid shooting oneself in the foot. You need to understand the memory > implications of sorting on different fields, and you need to > understand that to sort on a text field
RE: Payloads and TrieRangeQuery
From: Michael McCandless [mailto:luc...@mikemccandless.com] > On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley > wrote: > > > Really goes into Solr land... my pref for Lucene is to remain a core > > expert-level full-text search library and keep out things that are > > easy to do in an application or at another level. > > I think this must be the crux of our disagreement. > > I feel, instead, that Lucene should stand on its own, as a useful > search library, with a consumable API, good defaults, etc. Lucene is > more than "the expert level search API that's embedded in > Solr". Lucene is consumed directly by apps other than Solr. This is my opinion, too > In fact, I think there are many things in Solr that naturally belong > in Lucene (and over time we've been gradually slurping them down). > The line/criteria has always been rather blurry... There is currently also some overlapping, like functions queries implemented in both projects in different versions. Because the function queries were moved to lucene in the past, but they are still alive in Solr. There is also some overlap in the Analyzers. I would really like to have this very generic configureable analyzer in Lucene, so I do not need to create a new sub class for each analyzer. For other projects (like my panFMP), this would be really great, to just create an Analyzer instance and add some filters into a list with addFilter or something like that. In my opinion, solr and lucene should exchange technology much more. Solr should concentrate on the "search server" and lucene should provide the technology. All additional implementations inside solr like faceting and so on, could also live in lucene. I would have nice usages for it (I do not use Solr, but have my own Lucene framework, that I cannot give up because of various reasons). But e.g. Solr's facetting, Solr's analyzers and so on would be great in lucene as "modules". > In Lucene, we should be able to add a NumericField to a document, > index it, and then create RangeFilter or Sort on that field and have > things "just work". > > Unfortunately, we are far from being able to do so today. We can (and > should, for 2.9) make baby steps towards this, by aborbing trie* into > core, but you're still gonna have to "just know" to make the specific > FieldCache parser and specific NumericRangeQuery to match how you > indexed. It's awkward and not consumable, but I think it'll just have to > do for now... I will have time at the weekend (currently I am working hard for PANGAEA not related to Lucene at the moment) and create a core-trierange like defined in LUCENE-1673. The usage pattern will be similar to contrib., now, only with a revised API, making it simplier to instantiate the TokenStreams and RangeQueries with the different data types (I get a lot of questions all the time, how to do this and that, because the people don't understand, why they must first map the float to an int. If they do it, the next question is: "why does this work, I do not want to loose precision" and so on. I will do it with TrieTokenStream In my opinion, the classes should stay Trie* and not Numeric*. Maybe we have different implementations for numeric Range queries in future. In this case, I think Yonik is right. The class name should describe how it works if there may be different and incompatible implementations in future. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Thu, Jun 11, 2009 at 7:01 AM, Michael McCandless wrote: > On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley > wrote: > >> Really goes into Solr land... my pref for Lucene is to remain a core >> expert-level full-text search library and keep out things that are >> easy to do in an application or at another level. > > I think this must be the crux of our disagreement. Indeed. The itch to scratch w.r.t Solr in Lucene is increased core functionality, not more magic (that duplicates what Solr already does, but just in a different way and thus makes the lives of Solr developers harder). If we asked on java-user about people's priorities/wishes, I bet column stride fields, near real time indexing, and better performance would dominate stuff like not having to specify how to sort a field. > I feel, instead, that Lucene should stand on its own, as a useful > search library, with a consumable API, good defaults, etc. Lucene is > more than "the expert level search API that's embedded in > Solr". Lucene is consumed directly by apps other than Solr. > > In fact, I think there are many things in Solr that naturally belong > in Lucene (and over time we've been gradually slurping them down). > The line/criteria has always been rather blurry... And conversely, Solr isn't just a wrapper around Lucene and an incubator for Lucene technology. Ask Lucene users if they would like pretty much any substantial piece of functionality in Solr moved to Lucene as a module and you'll probably get an affirmative answer. But moving something from Solr to Lucene can have a lot of negative effects for Solr, including taking it out of the hands of Solr committers who aren't Lucene committers, and taking it out of Solr's release cycle and easy ability to change - if Solr needs to make a change to one of the moved classes, it's necessary to get it through the Lucene change process and then upgrade to the latest Lucene trunk - all or nothing. It's also the case that the goals of Lucene classes and Solr classes are often very different. Lucene is more concerned with Java APIs (as should be the case), while they are a bit more secondary in Solr... the external APIs are of primary importance and one doesn't worry as much (or at all) about the classes implementing that interface or it's Java API back compatibility (as a generalization... it depends on the class). > In Lucene, we should be able to add a NumericField to a document, > index it, and then create RangeFilter or Sort on that field and have > things "just work". That feels like a false sense of simplicity, and Lucene isn't for dummies ;-) One needs to understand how things work under the hood to avoid shooting oneself in the foot. You need to understand the memory implications of sorting on different fields, and you need to understand that to sort on a text field, there really needs to be just one token per field. You need to understand that the way Trie is indexed, and that multiple values per field won't work if you use a precision step less than the word size. There have been a lot of bad design decisions (I'm talking software development in general) due to citing "the user will be confused". Often, this hypothetical user doesn't exist (or is an extreme minority), and hence I prefer things of the form "I think this is confusing". Extra magic isn't always a good thing. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 6:07 PM, Yonik Seeley wrote: > Really goes into Solr land... my pref for Lucene is to remain a core > expert-level full-text search library and keep out things that are > easy to do in an application or at another level. I think this must be the crux of our disagreement. I feel, instead, that Lucene should stand on its own, as a useful search library, with a consumable API, good defaults, etc. Lucene is more than "the expert level search API that's embedded in Solr". Lucene is consumed directly by apps other than Solr. In fact, I think there are many things in Solr that naturally belong in Lucene (and over time we've been gradually slurping them down). The line/criteria has always been rather blurry... In Lucene, we should be able to add a NumericField to a document, index it, and then create RangeFilter or Sort on that field and have things "just work". Unfortunately, we are far from being able to do so today. We can (and should, for 2.9) make baby steps towards this, by aborbing trie* into core, but you're still gonna have to "just know" to make the specific FieldCache parser and specific NumericRangeQuery to match how you indexed. It's awkward and not consumable, but I think it'll just have to do for now... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 5:45 PM, Michael McCandless wrote: > But, I realize this is a stretch... eg we'd have to fix rewrite to be > per-segment, which certainly seems spooky. A top-level schema would > definitely be cleaner. Really goes into Solr land... my pref for Lucene is to remain a core expert-level full-text search library and keep out things that are easy to do in an application or at another level. Having to specify what type of field you are sorting on really doesn't seem like a hardship. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
> > Another question not so simple to answer: When embedding these > TermPositions > > into the whole process, how would this work with MultiTermQuery? > > There's no reason why Trie has to use MultiTermQuery, right? No but is elegant and simplifies much (see current code in trunk). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
> I think we'd need richer communication between MTQ and its subclasses, > so that eg your enum would return a Query instead of a Term? > > Then you'd either return a TermQuery, or, a BooleanQuery that's > filtering the TermQuery? > > But yes, doing after 3.0 seems good! There is one other thing that needs to wait for 3.x: If you then want to sort against such a field or use the trie values for function queries in a field cache, we can have a really fast numeric UninverterValueSource, because less terms, each with many documents. The value to store in the cache is only (prefixCodedToLong(term) << positionBits | termPosition) - cool. Would be really fast! (But for that we need the new field cache stuff). ...now going to sleep with many ideas buzzing around. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
> Another question not so simple to answer: When embedding these TermPositions > into the whole process, how would this work with MultiTermQuery? There's no reason why Trie has to use MultiTermQuery, right? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 5:24 PM, Yonik Seeley wrote: > On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless > wrote: >> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: >> * Was the field even indexed w/ Trie, or indexed as "simple text"? > > Why the special treatment for Trie? So that at search time things default properly. Ie, RangeFilter would rewrite to the right impl (if we made a single RangeFilter that handled both term & numeric ranges), and sorting could pick the right parser. Ie, ideally one simply adds NumericField to their Document, indexes it, and then range filtering & sorting "just work". It's confusing now the separate steps you must go through to use trie, because Lucene doesn't remember that you indexed with trie. But, I realize this is a stretch... eg we'd have to fix rewrite to be per-segment, which certainly seems spooky. A top-level schema would definitely be cleaner. >> * We have a bug (or an important improvement) in how Trie encodes >>terms that we need to fix. This one is not easy to handle, since >>such a change could alter the term order, and merging segments >>then becomes problematic. Not sure how to handle that. Yonik, >>has Solr ever had to make a change to NumberUtils? > > Nope. If we needed to, we would make a new field type so that > existing schemas/indexes would continue to work. OK seems like Lucene should take the same approach. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
I think we'd need richer communication between MTQ and its subclasses, so that eg your enum would return a Query instead of a Term? Then you'd either return a TermQuery, or, a BooleanQuery that's filtering the TermQuery? But yes, doing after 3.0 seems good! Mike On Wed, Jun 10, 2009 at 5:26 PM, Uwe Schindler wrote: >> I would like to go forward with moving the classes into the right packages >> and optimize the way, how queries and analyzers are created (only one >> class >> for each). The idea from LUCENE-1673 to use static factories to create >> these >> classes for the different data types seems to be more elegant and simplier >> to maintain than the current way (having a class for each bit size). >> >> So I think I will start with 1673 and try to present something useable, >> soon >> (but without payloads, so the payload/position-bits setting is "0"). > > Another question not so simple to answer: When embedding these TermPositions > into the whole process, how would this work with MultiTermQuery? The current > algorithm is simple: The TrieRangeTermEnum simply enumerates the possible > terms from the index and MTQ creates the BitSet or a BooleanQuery of > TermQueries. How to do this with positions? In both cases there need > specialities (the TermEnum must return that the actual term is a > payload/position one and must filter using TermPositions). For the filter > its then easy, the TermQueries added to BooleanQuery in the other case must > also use the payloads. Questions & more questions. > > I tend to release TrieRange with 2.9 without Positions/Payloads. > > Uwe > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
> * Was the field even indexed w/ Trie, or indexed as "simple text"? > It's useful to know this "automatically" at search time, so eg a > RangeQuery can do the right thing by default. FieldInfos seems > like the natural place to store this. It's basically Lucene's > per-segment write-once schema. Eg we use this to record "did any > token in this field have a Payload?", which is analogous. This should really be in a schema of some kind (like in my project for instance). Why do you do autodetection for tries, but recently removed it for FieldCache? Things should be concise, either store all settings in the index (and die in the process), or don't store them there at all. > * We have a bug (or an important improvement) in how Trie encodes > terms that we need to fix. This one is not easy to handle, since > such a change could alter the term order, and merging segments > then becomes problematic. Not sure how to handle that. Yonik, > has Solr ever had to make a change to NumberUtils? There are cases when reindexing is inevitable. What so horrible about it anyway? Even if you have a humongous index, you can rebuild it in a matter of days, and you don't do this often. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
> I would like to go forward with moving the classes into the right packages > and optimize the way, how queries and analyzers are created (only one > class > for each). The idea from LUCENE-1673 to use static factories to create > these > classes for the different data types seems to be more elegant and simplier > to maintain than the current way (having a class for each bit size). > > So I think I will start with 1673 and try to present something useable, > soon > (but without payloads, so the payload/position-bits setting is "0"). Another question not so simple to answer: When embedding these TermPositions into the whole process, how would this work with MultiTermQuery? The current algorithm is simple: The TrieRangeTermEnum simply enumerates the possible terms from the index and MTQ creates the BitSet or a BooleanQuery of TermQueries. How to do this with positions? In both cases there need specialities (the TermEnum must return that the actual term is a payload/position one and must filter using TermPositions). For the filter its then easy, the TermQueries added to BooleanQuery in the other case must also use the payloads. Questions & more questions. I tend to release TrieRange with 2.9 without Positions/Payloads. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless wrote: > On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: > * Was the field even indexed w/ Trie, or indexed as "simple text"? Why the special treatment for Trie? > It's useful to know this "automatically" at search time, so eg a > RangeQuery can do the right thing by default. FieldInfos seems > like the natural place to store this. It's basically Lucene's > per-segment write-once schema. Eg we use this to record "did any > token in this field have a Payload?", which is analogous. It doesn't seem analogous to me. Trie is just another implementation for numerics with it's own tradeoffs. > * We have a bug (or an important improvement) in how Trie encodes > terms that we need to fix. This one is not easy to handle, since > such a change could alter the term order, and merging segments > then becomes problematic. Not sure how to handle that. Yonik, > has Solr ever had to make a change to NumberUtils? Nope. If we needed to, we would make a new field type so that existing schemas/indexes would continue to work. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 5:07 PM, Uwe Schindler wrote: > I would really like to leave this optimization out for 2.9. We can still add > this after 2.9 as an optimization. The number of bits encoded into the > TermPosition (this is really a cool idea, thanks Yonik, I was missing > exactly that, because you do not need to convert the bits, you can directly > put them into the index as int and use them on the query side!) is simply 0 > for indexes created with 2.9. With later versions, you could also shift the > lower bits into the TermPosition and tell TrieRange to filter them. I agree, let's aim for after 3.0 for this. (Note that, in theory, 3.0 should follow quickly after 2.9, having "only" removed deprecated APIs, changed settings, etc.). Can you open & issue & mark as 3.1? > I would like to go forward with moving the classes into the right packages > and optimize the way, how queries and analyzers are created (only one class > for each). The idea from LUCENE-1673 to use static factories to create these > classes for the different data types seems to be more elegant and simplier > to maintain than the current way (having a class for each bit size). +1 > So I think I will start with 1673 and try to present something useable, soon > (but without payloads, so the payload/position-bits setting is "0"). > Now the oen question: Which name for the numeric range queries/fields? :-( How about: Range* -> TermRange* TrieRange* -> NumericRange* FieldCacheRangeFilter -> FieldCacheTermRangeFilter ConstantScoreRangeQuery stays as is (it's deprecated) Are there any others that need renaming? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
> On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless > wrote: > > On Wed, Jun 10, 2009 at 3:19 PM, Yonik > Seeley wrote: > > > >>> And this information about the trie > >>> structure and where payloads are should be stored in FieldInfos. > >> > >> As is the case today, the info is encoded in the class you use (and > >> it's settings)... no need to add it to the index structure. In any > >> case, it's a completely different issue and shouldn't be tied to > >> TrieRange improvements. > > > > The problem is, because the details of Trie* at index time affect > > what's in each segment, this information needs to be stored per > > segment. > > That's the case with the analysis for every field. If you change your > analyzer in a non-compatible fashion, you need to re-index. I agree with Mike to store information like the data type in the index, but on the other hand, Yonik is correct, too. If I change my analyzer (and TrieTokenStream is in fact one, an analyzer that creates tokens out of a number), I have to reindex. The problem with storing different indexing settings (precisionStep, payload/position bits) per segment makes merging nearly impossible, so I would not do this (see also Earwins comment about that). About releasing 2.9: I would really like to leave this optimization out for 2.9. We can still add this after 2.9 as an optimization. The number of bits encoded into the TermPosition (this is really a cool idea, thanks Yonik, I was missing exactly that, because you do not need to convert the bits, you can directly put them into the index as int and use them on the query side!) is simply 0 for indexes created with 2.9. With later versions, you could also shift the lower bits into the TermPosition and tell TrieRange to filter them. I would like to go forward with moving the classes into the right packages and optimize the way, how queries and analyzers are created (only one class for each). The idea from LUCENE-1673 to use static factories to create these classes for the different data types seems to be more elegant and simplier to maintain than the current way (having a class for each bit size). So I think I will start with 1673 and try to present something useable, soon (but without payloads, so the payload/position-bits setting is "0"). Now the oen question: Which name for the numeric range queries/fields? :-( Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: > And then, when you merge segments indexed with different Trie* > settings, you need to convert them to some common form. > Sounds like something too complex and with minimum returns. Oh yeah... tricky. So... there are various situations to handle with trie: * Was the field even indexed w/ Trie, or indexed as "simple text"? It's useful to know this "automatically" at search time, so eg a RangeQuery can do the right thing by default. FieldInfos seems like the natural place to store this. It's basically Lucene's per-segment write-once schema. Eg we use this to record "did any token in this field have a Payload?", which is analogous. * How did you tune your payload-vs-trie-range setting. OK, I agree: this is most similar to "you changed your analyzer in an incompatible way, so, you have to reindex". Plus, during merging we can't [easily] translate this. So we shouldn't try to keep track of this. * We have a bug (or an important improvement) in how Trie encodes terms that we need to fix. This one is not easy to handle, since such a change could alter the term order, and merging segments then becomes problematic. Not sure how to handle that. Yonik, has Solr ever had to make a change to NumberUtils? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless wrote: > On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley > wrote: > >>> And this information about the trie >>> structure and where payloads are should be stored in FieldInfos. >> >> As is the case today, the info is encoded in the class you use (and >> it's settings)... no need to add it to the index structure. In any >> case, it's a completely different issue and shouldn't be tied to >> TrieRange improvements. > > The problem is, because the details of Trie* at index time affect > what's in each segment, this information needs to be stored per > segment. That's the case with the analysis for every field. If you change your analyzer in a non-compatible fashion, you need to re-index. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
>>> And this information about the trie >>> structure and where payloads are should be stored in FieldInfos. >> >> As is the case today, the info is encoded in the class you use (and >> it's settings)... no need to add it to the index structure. In any >> case, it's a completely different issue and shouldn't be tied to >> TrieRange improvements. > > The problem is, because the details of Trie* at index time affect > what's in each segment, this information needs to be stored per > segment. And then, when you merge segments indexed with different Trie* settings, you need to convert them to some common form. Sounds like something too complex and with minimum returns. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley wrote: >> And this information about the trie >> structure and where payloads are should be stored in FieldInfos. > > As is the case today, the info is encoded in the class you use (and > it's settings)... no need to add it to the index structure. In any > case, it's a completely different issue and shouldn't be tied to > TrieRange improvements. The problem is, because the details of Trie* at index time affect what's in each segment, this information needs to be stored per segment. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote: >> I wonder how performance would compare. Without payloads, there are >> many more terms (for the tiny ranges) in the index, and your OR query >> will have lots of these tiny terms. But then these tiny terms don't >> hit many docs, and with BooleanScorer (which we should switch to for >> OR queries) ought not be very costly. > > That ist true. The main idea was to limit also seeking during the query. > When splitting the range, you need to often start new TermEnums and iterate > over lot of term. By catching many docs with less terms, you only need to > scan forward in the payloads. OK, though we should separately test "cold" searches (seeking matters) and "hot" searches (seeking doesn't). And we should separately test SSD vs spinning drive for the cold case. Seeking is much less costly (though still more costly than "hot" searches) with SSDs... >> Vs w/ payloads having to use >> TermPositions, having to load, decode & check the payload, and I guess >> assuming on average that 1/2 the docs are filtered out. > > Maybe decoding the payload is not needed, I would encode the bounds as > byte[] and compare the arrays. But you would filter about half of the docs > out. Yonik's idea (encoding in the position) seems great here. > My problem with all this is how to optimize after which shift value to > switch between terms and payloads. Presumably you'd "roughly" balance seek time vs "wasted doc filtered out" time, to set the default, and make it configurable. > And this information about the trie > structure and where payloads are should be stored in FieldInfos. > > As we now search on each segment separately, this information can be stored > per segment and also used for each per-segment Filter/Scorer. Right, I think it should, but I agree w/ Yonik (partially) that it's orthogonal. > The whole thing works out of the box with TrieRangeFilter (its just > iterating over terms, getting TermDocs/TermPositions and setting bits, when > payloads available after checking these), for TrieRangeQuery using > BooleanQuery it is more complicated (MTQ cannot simply add the terms from > the FilteredTermEnum to a BooleanQuery). Seems like we should generalize MTQ so that the subclass could return which clause should be added for each term, to the BQ? (We are also still needing to improve MTQ to decouple constant-scoring from "use BQ or filter"... there's an issue opened for that). > Until now I had no time to think about it in detail, but with maybe the > possibility to have TrieRange in Core and store trie-specific FieldInfos per > segment, I will get clearer how to manage this in the API. I'd really like to see TrieRange in core for 2.9... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote: > My problem with all this is how to optimize after which shift value to > switch between terms and payloads. Just make it a configurable number of bits at the end that are "stored" instead of indexed. People will want to select different tradeoffs anyway. What about using the position (as opposed to a payload) to encode the last bits? Should be faster, no? > And this information about the trie > structure and where payloads are should be stored in FieldInfos. As is the case today, the info is encoded in the class you use (and it's settings)... no need to add it to the index structure. In any case, it's a completely different issue and shouldn't be tied to TrieRange improvements. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
> Ooh that sounds compelling! > > So you would not need to use payloads for the "inside" brackets, > right? Only for the edges? Exactly. > I wonder how performance would compare. Without payloads, there are > many more terms (for the tiny ranges) in the index, and your OR query > will have lots of these tiny terms. But then these tiny terms don't > hit many docs, and with BooleanScorer (which we should switch to for > OR queries) ought not be very costly. That ist true. The main idea was to limit also seeking during the query. When splitting the range, you need to often start new TermEnums and iterate over lot of term. By catching many docs with less terms, you only need to scan forward in the payloads. > Vs w/ payloads having to use > TermPositions, having to load, decode & check the payload, and I guess > assuming on average that 1/2 the docs are filtered out. Maybe decoding the payload is not needed, I would encode the bounds as byte[] and compare the arrays. But you would filter about half of the docs out. My problem with all this is how to optimize after which shift value to switch between terms and payloads. And this information about the trie structure and where payloads are should be stored in FieldInfos. As we now search on each segment separately, this information can be stored per segment and also used for each per-segment Filter/Scorer. The whole thing works out of the box with TrieRangeFilter (its just iterating over terms, getting TermDocs/TermPositions and setting bits, when payloads available after checking these), for TrieRangeQuery using BooleanQuery it is more complicated (MTQ cannot simply add the terms from the FilteredTermEnum to a BooleanQuery). Until now I had no time to think about it in detail, but with maybe the possibility to have TrieRange in Core and store trie-specific FieldInfos per segment, I will get clearer how to manage this in the API. Uwe > On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote: > > Hi, sorry I missed the first mail. > > > > > > > > The idea we discussed in Amsterdam during ApacheCon was: > > > > > > > > Instead of indexing all trie precisions from e.g. the leftmost 8 bits > downto > > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions > 8 > > to 56. The last precision is left out. Instead the last term (precision > 56) > > contains the highest precision as payload. > > > > On the query side, TrieRangeQuery would create the filter bitmap as > before > > until it reaches the lowest available precision with the payloads. > Instead > > of further splitting this precision into terms, all TermPositions > instead of > > just TermDocs are listed, but only those set in the result BitSet, that > have > > the payload inside the range bounds. By this the trie query first > selects > > large ranges in the middle like before, but uses the highest (but not > full > > precision term) to select more docids than needed but filters them with > the > > payload. > > > > > > > > With String Dates (the simplified example Michael Busch shows in his > talk): > > > > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange > > variant would select terms 2005-11-10 to 2005-11-30, then the whole > > December, the whole years 2006 and 2007 and so on. With payloads, > trierange > > would select only whole months (November, December, 2006, 2007, Jan, > Feb, > > Mar). At the ends the payloads are used to filter out the days in Nov > 2005 > > and Mar 2008. > > > > > > > > With the latest TrieRange impl this would be possible to implement > (because > > the TrieTokenStreams now used for indexing could create the payloads). > Only > > the searching side would no longer so simple implemented as yet. My > > biggest problem is how to configure this optimal and make the API clean. > > > > > > > > Was it understandable? (Its complicated, I know) > > > > > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > > Sent: Wednesday, June 10, 2009 7:59 PM > > To: java-dev@lucene.apache.org > > Subject: Re: Payloads and TrieRangeQuery > > > > > > > > I think instead of ORing postings (trie range, rangequery, etc), have a > > custom Query + Scorer that examines the payload (somehow)? It could > encode > > the multi
Re: Payloads and TrieRangeQuery
Yep, makes sense. It could be a little slower, but it would decrease the number of terms indexed by a factor of 256 (for 8 bits). But the payload part... seems like another case of using that because CSF isn't there yet, right? (well, perhaps except if you didn't want to store the field...) -Yonik http://www.lucidimagination.com On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote: > Hi, sorry I missed the first mail. > > > > The idea we discussed in Amsterdam during ApacheCon was: > > > > Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8 > to 56. The last precision is left out. Instead the last term (precision 56) > contains the highest precision as payload. > > On the query side, TrieRangeQuery would create the filter bitmap as before > until it reaches the lowest available precision with the payloads. Instead > of further splitting this precision into terms, all TermPositions instead of > just TermDocs are listed, but only those set in the result BitSet, that have > the payload inside the range bounds. By this the trie query first selects > large ranges in the middle like before, but uses the highest (but not full > precision term) to select more docids than needed but filters them with the > payload. > > > > With String Dates (the simplified example Michael Busch shows in his talk): > > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange > variant would select terms 2005-11-10 to 2005-11-30, then the whole > December, the whole years 2006 and 2007 and so on. With payloads, trierange > would select only whole months (November, December, 2006, 2007, Jan, Feb, > Mar). At the ends the payloads are used to filter out the days in Nov 2005 > and Mar 2008. > > > > With the latest TrieRange impl this would be possible to implement (because > the TrieTokenStreams now used for indexing could create the payloads). Only > the searching side would no longer so “simple” implemented as yet. My > biggest problem is how to configure this optimal and make the API clean. > > > > Was it understandable? (Its complicated, I know) > > > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > ____________ > > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: Wednesday, June 10, 2009 7:59 PM > To: java-dev@lucene.apache.org > Subject: Re: Payloads and TrieRangeQuery > > > > I think instead of ORing postings (trie range, rangequery, etc), have a > custom Query + Scorer that examines the payload (somehow)? It could encode > the multiple levels of trie bits in it? (I'm just guessing here). > > On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless > wrote: > > Use them how? (Sounds interesting...). > > Mike > > On Tue, Jun 9, 2009 at 10:32 PM, Jason > Rutherglen wrote: >> At the SF Lucene User's group, Michael Busch mentioned using >> payloads with TrieRangeQueries. Is this something that's being >> worked on? I'm interested in what sort performance benefits >> there would be to this method? >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
Ooh that sounds compelling! So you would not need to use payloads for the "inside" brackets, right? Only for the edges? I wonder how performance would compare. Without payloads, there are many more terms (for the tiny ranges) in the index, and your OR query will have lots of these tiny terms. But then these tiny terms don't hit many docs, and with BooleanScorer (which we should switch to for OR queries) ought not be very costly. Vs w/ payloads having to use TermPositions, having to load, decode & check the payload, and I guess assuming on average that 1/2 the docs are filtered out. Mike On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote: > Hi, sorry I missed the first mail. > > > > The idea we discussed in Amsterdam during ApacheCon was: > > > > Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8 > to 56. The last precision is left out. Instead the last term (precision 56) > contains the highest precision as payload. > > On the query side, TrieRangeQuery would create the filter bitmap as before > until it reaches the lowest available precision with the payloads. Instead > of further splitting this precision into terms, all TermPositions instead of > just TermDocs are listed, but only those set in the result BitSet, that have > the payload inside the range bounds. By this the trie query first selects > large ranges in the middle like before, but uses the highest (but not full > precision term) to select more docids than needed but filters them with the > payload. > > > > With String Dates (the simplified example Michael Busch shows in his talk): > > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange > variant would select terms 2005-11-10 to 2005-11-30, then the whole > December, the whole years 2006 and 2007 and so on. With payloads, trierange > would select only whole months (November, December, 2006, 2007, Jan, Feb, > Mar). At the ends the payloads are used to filter out the days in Nov 2005 > and Mar 2008. > > > > With the latest TrieRange impl this would be possible to implement (because > the TrieTokenStreams now used for indexing could create the payloads). Only > the searching side would no longer so “simple” implemented as yet. My > biggest problem is how to configure this optimal and make the API clean. > > > > Was it understandable? (Its complicated, I know) > > > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > ____________ > > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: Wednesday, June 10, 2009 7:59 PM > To: java-dev@lucene.apache.org > Subject: Re: Payloads and TrieRangeQuery > > > > I think instead of ORing postings (trie range, rangequery, etc), have a > custom Query + Scorer that examines the payload (somehow)? It could encode > the multiple levels of trie bits in it? (I'm just guessing here). > > On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless > wrote: > > Use them how? (Sounds interesting...). > > Mike > > On Tue, Jun 9, 2009 at 10:32 PM, Jason > Rutherglen wrote: >> At the SF Lucene User's group, Michael Busch mentioned using >> payloads with TrieRangeQueries. Is this something that's being >> worked on? I'm interested in what sort performance benefits >> there would be to this method? >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Payloads and TrieRangeQuery
Hi, sorry I missed the first mail. The idea we discussed in Amsterdam during ApacheCon was: Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8 to 56. The last precision is left out. Instead the last term (precision 56) contains the highest precision as payload. On the query side, TrieRangeQuery would create the filter bitmap as before until it reaches the lowest available precision with the payloads. Instead of further splitting this precision into terms, all TermPositions instead of just TermDocs are listed, but only those set in the result BitSet, that have the payload inside the range bounds. By this the trie query first selects large ranges in the middle like before, but uses the highest (but not full precision term) to select more docids than needed but filters them with the payload. With String Dates (the simplified example Michael Busch shows in his talk): Searching all docs from 2005-11-10 to 2008-03-11 with current trierange variant would select terms 2005-11-10 to 2005-11-30, then the whole December, the whole years 2006 and 2007 and so on. With payloads, trierange would select only whole months (November, December, 2006, 2007, Jan, Feb, Mar). At the ends the payloads are used to filter out the days in Nov 2005 and Mar 2008. With the latest TrieRange impl this would be possible to implement (because the TrieTokenStreams now used for indexing could create the payloads). Only the searching side would no longer so "simple" implemented as yet. My biggest problem is how to configure this optimal and make the API clean. Was it understandable? (Its complicated, I know) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de> http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Wednesday, June 10, 2009 7:59 PM To: java-dev@lucene.apache.org Subject: Re: Payloads and TrieRangeQuery I think instead of ORing postings (trie range, rangequery, etc), have a custom Query + Scorer that examines the payload (somehow)? It could encode the multiple levels of trie bits in it? (I'm just guessing here). On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless wrote: Use them how? (Sounds interesting...). Mike On Tue, Jun 9, 2009 at 10:32 PM, Jason Rutherglen wrote: > At the SF Lucene User's group, Michael Busch mentioned using > payloads with TrieRangeQueries. Is this something that's being > worked on? I'm interested in what sort performance benefits > there would be to this method? > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
I think instead of ORing postings (trie range, rangequery, etc), have a custom Query + Scorer that examines the payload (somehow)? It could encode the multiple levels of trie bits in it? (I'm just guessing here). On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Use them how? (Sounds interesting...). > > Mike > > On Tue, Jun 9, 2009 at 10:32 PM, Jason > Rutherglen wrote: > > At the SF Lucene User's group, Michael Busch mentioned using > > payloads with TrieRangeQueries. Is this something that's being > > worked on? I'm interested in what sort performance benefits > > there would be to this method? > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
Re: Payloads and TrieRangeQuery
Use them how? (Sounds interesting...). Mike On Tue, Jun 9, 2009 at 10:32 PM, Jason Rutherglen wrote: > At the SF Lucene User's group, Michael Busch mentioned using > payloads with TrieRangeQueries. Is this something that's being > worked on? I'm interested in what sort performance benefits > there would be to this method? > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org