Re: reset versus setReader on TokenStream
ok, lets help improve it: I think these have likely always been confusing. before they were both reset: reset() and reset(Reader), even though they are unrelated. I thought the rename would help this :) Does the TokenStream workfloat here help? http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html Basically reset() is a mandatory thing the consumer must call. it just means 'reset any mutable state so you can be reused for processing again'. This is something on any TokenStream: Tokenizers, TokenFilters, or even some direct descendent you make that parses byte arrays, or whatever. This means if you are keeping some state across tokens (like stopfilter's #skippedTokens). here is where you would set that = 0 again. setReader(Reader) is only on Tokenizer, it means replace the Reader with a different one to be processed. The fact that CharTokenizer is doing 'reset()-like-stuff' in here is bogus IMO, but I dont think it will cause any bugs. Don't emulate it :) On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote: I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement #reset, which leads me to suspect that I'm not responsible for the rewind behavior. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote: ok, lets help improve it: I think these have likely always been confusing. before they were both reset: reset() and reset(Reader), even though they are unrelated. I thought the rename would help this :) Does the TokenStream workfloat here help? http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html Basically reset() is a mandatory thing the consumer must call. it just means 'reset any mutable state so you can be reused for processing again'. I really did read this. setReader I get; I don't understand what reset accomplishes. What does it mean to reuse one a TokenStream without calling setReader to supply a new input? If it means reuse the old input, who does the rewinding? This is something on any TokenStream: Tokenizers, TokenFilters, or even some direct descendent you make that parses byte arrays, or whatever. This means if you are keeping some state across tokens (like stopfilter's #skippedTokens). here is where you would set that = 0 again. setReader(Reader) is only on Tokenizer, it means replace the Reader with a different one to be processed. The fact that CharTokenizer is doing 'reset()-like-stuff' in here is bogus IMO, but I dont think it will cause any bugs. Don't emulate it :) On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote: I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement #reset, which leads me to suspect that I'm not responsible for the rewind behavior. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 3:45 PM, Benson Margulies ben...@basistech.com wrote: On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote: ok, lets help improve it: I think these have likely always been confusing. before they were both reset: reset() and reset(Reader), even though they are unrelated. I thought the rename would help this :) Does the TokenStream workfloat here help? http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html Basically reset() is a mandatory thing the consumer must call. it just means 'reset any mutable state so you can be reused for processing again'. I really did read this. setReader I get; I don't understand what reset accomplishes. What does it mean to reuse one a TokenStream without calling setReader to supply a new input? TokenStream is more generic, it doesnt have to take Reader. It can take anything you want: e.g. a String or a byte array of your Word document or whatever. Tokenizer is a subclass that takes Reader. its the only thing that has setReader. reset() doesnt mean rewind. it just means clearing any accumulated internal state so its ready for processing again. so if i made a StringTokenizer class that extends Tokenizer, i would probably add setString(String s) to it so i could set new string objects on it, but consumers must always call reset() on the entire chain (the outer stopfilters, synonym filters, all this stuff that might be keeping state). this reset() call chains down all tokenstreams. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
Some interlinear commentary on the doc. * Resets this stream to the beginning. To me this implies a rewind. As previously noted, I don't see how this works for the existing implementations. * As all TokenStreams must be reusable, * any implementations which have state that needs to be reset between usages * of the TokenStream, must implement this method. Note that if your TokenStream * caches tokens and feeds them back again after a reset, What's the alternative? What happens with all the existing Tokenizers that have no special implementation of #reset()? * it is imperative * that you clone the tokens when you store them away (on the first pass) as * well as when you return them (on future passes after {@link #reset()}).
Re: reset versus setReader on TokenStream
I think I'm beginning to get the idea. Is the following plausible? At the bottom of the stack, there's an actual source of data -- like a tokenizer. For one of those, reset() is a bit silly, and something like setReader is the brains of the operation. Some number of other components may be stacked up on top of the source of data, and these may have local state. Calling #reset prepared them for new data to emerge from the actual source of data.
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 3:54 PM, Benson Margulies ben...@basistech.com wrote: Some interlinear commentary on the doc. * Resets this stream to the beginning. To me this implies a rewind. As previously noted, I don't see how this works for the existing implementations. its not a rewind. the javadocs here are not good. we need to fix them to be clear :) * As all TokenStreams must be reusable, * any implementations which have state that needs to be reset between usages * of the TokenStream, must implement this method. Note that if your TokenStream * caches tokens and feeds them back again after a reset, What's the alternative? What happens with all the existing Tokenizers that have no special implementation of #reset()? perhaps these Tokenizers have no state to reset()? lots of tokenstream classes are stateless. if you are stateless, then you dont need to implement this method. You get the default implementation: e.g. TokenFilter's just passes it down the chain (input.reset()), and i think Tokenizer/TokenStream are no-ops. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 3:58 PM, Benson Margulies ben...@basistech.com wrote: I think I'm beginning to get the idea. Is the following plausible? At the bottom of the stack, there's an actual source of data -- like a tokenizer. For one of those, reset() is a bit silly, and something like setReader is the brains of the operation. Actually i think setReader() is silly in most cases for Tokenizers. Most tokenizers should never override this (in fact technically we could make it final or something, to make it super-clear, but that might be a bit over the top) The default implementation in Tokenizer.java should almost always suffice, as it does what you expect a setter would do in java: public void setReader(Reader input) throws IOException { assert input != null: input must not be null; this.input = input; } So lets take your CharTokenizer example: @Override public void setReader(Reader input) throws IOException { super.setReader(input); bufferIndex = 0; offset = 0; dataLen = 0; finalOffset = 0; ioBuffer.reset(); // make sure to reset the IO buffer!! } Really this is bogus, i think it should not override this method at all, and instead should do: @Override public void reset() throws IOException { // reset our internal state bufferIndex = 0; offset = 0; dataLen = 0; finalOffset = 0; ioBuffer.reset(); // make sure to reset the IO buffer!! } Does that make sense? -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
If I'm following, you've created a division of labor between setReader and reset. We have a tokenizer that has a good deal of state, since it has to split the input into chunks. If I'm following here, you'd recommend that we do nothing special in setReader, but have #reset fix up all the state on the assumption that we are are starting from the beginning of something, and we'd reinitialize our chunker over what was sitting in the protected 'input'. If someone called #setReader and neglected to call #reset, awful things would happen, but you've warned them. To me, it seemed natural to overload #setReader so that our tokenizer was in a consistent state once it was called. It occurs to me to wonder about order: if #reset is called before #setReader, I'm up creek unless I copy my reset implementation into a local override of #setReader.
RE: reset versus setReader on TokenStream
Hi, To me, it seemed natural to overload #setReader so that our tokenizer was in a consistent state once it was called. It occurs to me to wonder about order: if #reset is called before #setReader, I'm up creek unless I copy my reset implementation into a local override of #setReader. The order is defined in TokenStream and Tokenizer JavaDocs. First call setReader on the Tokenizer and after that the *consumer* has to call reset() on the chain of filters. When a user uses your Tokenizer, he will set a new Reader and then pass it to the indexer. Indexer (the consumer) will then call reset() before incrementToken() is called for the first time. In Lucene's BaseTokenStreamTestcase, this is asserted to be correct. Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 4:18 PM, Benson Margulies ben...@basistech.com wrote: If I'm following, you've created a division of labor between setReader and reset. Thats not true. setReader shouldnt be doing any labor. its really only a setter! One possibility here is to make it final (though its not obvious to me that it would clear up the situation, I think javadocs are more important here). We have a tokenizer that has a good deal of state, since it has to split the input into chunks. If I'm following here, you'd recommend that we do nothing special in setReader, but have #reset fix up all the state on the assumption that we are are starting from the beginning of something, and we'd reinitialize our chunker over what was sitting in the protected 'input'. If someone called #setReader and neglected to call #reset, awful things would happen, but you've warned them. If someone called setReader and neglected to call reset, aweful things will happen to them in general. they would be violating the contracts of the API and the workflow described in the javadocs. Thats why we test as much consumer code as possible against MockTokenizer (from test-framework package). it has a state machine that will fail if you do this. To me, it seemed natural to overload #setReader so that our tokenizer was in a consistent state once it was called. It occurs to me to wonder about order: if #reset is called before #setReader, I'm up creek unless I copy my reset implementation into a local override of #setReader. This would also be a violation on the consumer's part (also detected by MockTokenizer, in case you have such consumers like queryparsers or whatever you want to test). -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org