Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
ok, lets help improve it: I think these have likely always been confusing.

before they were both reset: reset() and reset(Reader), even though
they are unrelated. I thought the rename would help this :)

Does the TokenStream workfloat here help?
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
Basically reset() is a mandatory thing the consumer must call. it just
means 'reset any mutable state so you can be reused for processing
again'.
This is something on any TokenStream: Tokenizers, TokenFilters, or
even some direct descendent you make that parses byte arrays, or
whatever.

This means if you are keeping some state across tokens (like
stopfilter's #skippedTokens). here is where you would set that = 0
again.

setReader(Reader) is only on Tokenizer, it means replace the Reader
with a different one to be processed.
The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
bogus IMO, but I dont think it will cause any bugs. Don't emulate it
:)

On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote:
 I've read the javadoc through a few times, but I confess that I'm still
 feeling dense.

 Are all tokenizers responsible for implementing some way of retaining the
 contents of their reader, so that a call to reset without a call to
 setReader rewinds? I note that CharTokenizer doesn't implement #reset,
 which leads me to suspect that I'm not responsible for the rewind behavior.



-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote:

 ok, lets help improve it: I think these have likely always been confusing.

 before they were both reset: reset() and reset(Reader), even though
 they are unrelated. I thought the rename would help this :)

 Does the TokenStream workfloat here help?

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
 Basically reset() is a mandatory thing the consumer must call. it just
 means 'reset any mutable state so you can be reused for processing
 again'.


I really did read this. setReader I get; I don't understand what reset
accomplishes. What does it mean to reuse one a TokenStream without calling
setReader to supply a new input? If it means reuse the old input, who does
the rewinding?





 This is something on any TokenStream: Tokenizers, TokenFilters, or
 even some direct descendent you make that parses byte arrays, or
 whatever.

 This means if you are keeping some state across tokens (like
 stopfilter's #skippedTokens). here is where you would set that = 0
 again.

 setReader(Reader) is only on Tokenizer, it means replace the Reader
 with a different one to be processed.
 The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
 bogus IMO, but I dont think it will cause any bugs. Don't emulate it
 :)

 On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com
 wrote:
  I've read the javadoc through a few times, but I confess that I'm still
  feeling dense.
 
  Are all tokenizers responsible for implementing some way of retaining the
  contents of their reader, so that a call to reset without a call to
  setReader rewinds? I note that CharTokenizer doesn't implement #reset,
  which leads me to suspect that I'm not responsible for the rewind
 behavior.



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:45 PM, Benson Margulies ben...@basistech.com wrote:
 On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote:

 ok, lets help improve it: I think these have likely always been confusing.

 before they were both reset: reset() and reset(Reader), even though
 they are unrelated. I thought the rename would help this :)

 Does the TokenStream workfloat here help?

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
 Basically reset() is a mandatory thing the consumer must call. it just
 means 'reset any mutable state so you can be reused for processing
 again'.


 I really did read this. setReader I get; I don't understand what reset
 accomplishes. What does it mean to reuse one a TokenStream without calling
 setReader to supply a new input?

TokenStream is more generic, it doesnt have to take Reader. It can
take anything you want: e.g. a String or a byte array of your Word
document or whatever.

Tokenizer is a subclass that takes Reader. its the only thing that has
setReader.

reset() doesnt mean rewind. it just means clearing any accumulated
internal state so its ready for processing again.

so if i made a StringTokenizer class that extends Tokenizer, i would
probably add setString(String s) to it so i could set new string
objects on it, but consumers
must always call reset() on the entire chain (the outer stopfilters,
synonym filters, all this stuff that might be keeping state). this
reset() call chains down
all tokenstreams.

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
 Some interlinear commentary on the doc.

* Resets this stream to the beginning.

To me this implies a rewind.  As previously noted, I don't see how this
works for the existing implementations.

   * As all TokenStreams must be reusable,
   * any implementations which have state that needs to be reset between
usages
   * of the TokenStream, must implement this method. Note that if your
TokenStream
   * caches tokens and feeds them back again after a reset,

What's the alternative? What happens with all the existing Tokenizers that
have no special implementation of #reset()?

   * it is imperative
   * that you clone the tokens when you store them away (on the first pass)
as
   * well as when you return them (on future passes after {@link #reset()}).


Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I think I'm beginning to get the idea. Is the following plausible?

At the bottom of the stack, there's an actual source of data -- like a
tokenizer. For one of those, reset() is a bit silly, and something like
setReader is the brains of the operation.

Some number of other components may be stacked up on top of the source of
data, and these may have local state. Calling #reset prepared them for new
data to emerge from the actual source of data.


Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:54 PM, Benson Margulies ben...@basistech.com wrote:
  Some interlinear commentary on the doc.

 * Resets this stream to the beginning.

 To me this implies a rewind.  As previously noted, I don't see how this
 works for the existing implementations.

its not a rewind. the javadocs here are not good. we need to fix them
to be clear :)


* As all TokenStreams must be reusable,
* any implementations which have state that needs to be reset between
 usages
* of the TokenStream, must implement this method. Note that if your
 TokenStream
* caches tokens and feeds them back again after a reset,

 What's the alternative? What happens with all the existing Tokenizers that
 have no special implementation of #reset()?

perhaps these Tokenizers have no state to reset()? lots of tokenstream
classes are stateless.
if you are stateless, then you dont need to implement this method. You
get the default implementation: e.g. TokenFilter's just passes it down
the chain (input.reset()), and i think Tokenizer/TokenStream are
no-ops.


-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:58 PM, Benson Margulies ben...@basistech.com wrote:
 I think I'm beginning to get the idea. Is the following plausible?

 At the bottom of the stack, there's an actual source of data -- like a
 tokenizer. For one of those, reset() is a bit silly, and something like
 setReader is the brains of the operation.

Actually i think setReader() is silly in most cases for Tokenizers.
Most tokenizers should never override this (in fact technically we
could make it final or something, to make it super-clear, but that
might be a bit over the top)

The default implementation in Tokenizer.java should almost always
suffice, as it does what you expect a setter would do in java:

  public void setReader(Reader input) throws IOException {
assert input != null: input must not be null;
this.input = input;
  }

So lets take your CharTokenizer example:

  @Override
  public void setReader(Reader input) throws IOException {
super.setReader(input);
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Really this is bogus, i think it should not override this method at
all, and instead should do:

  @Override
  public void reset() throws IOException {
// reset our internal state
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Does that make sense?

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
If I'm following, you've created a division of labor between setReader and
reset.

We have a tokenizer that has a good deal of state, since it has to split
the input into chunks. If I'm following here, you'd recommend that we do
nothing special in setReader, but have #reset fix up all the state on the
assumption that we are are starting from the beginning of something, and
we'd reinitialize our chunker over what was sitting in the protected
'input'. If someone called #setReader and neglected to call #reset, awful
things would happen, but you've warned them.

To me, it seemed natural to overload #setReader so that our tokenizer was
in a consistent state once it was called. It occurs to me to wonder about
order: if #reset is called before #setReader, I'm up creek unless I copy my
reset implementation into a local override of #setReader.


RE: reset versus setReader on TokenStream

2012-08-29 Thread Uwe Schindler
Hi,
 
 To me, it seemed natural to overload #setReader so that our tokenizer was in a
 consistent state once it was called. It occurs to me to wonder about
 order: if #reset is called before #setReader, I'm up creek unless I copy my 
 reset
 implementation into a local override of #setReader.

The order is defined in TokenStream and Tokenizer JavaDocs. First call 
setReader on the Tokenizer and after that the *consumer* has to call reset() on 
the chain of filters. When a user uses your Tokenizer, he will set a new Reader 
and then pass it to the indexer. Indexer (the consumer) will then call reset() 
before incrementToken() is called for the first time. In Lucene's 
BaseTokenStreamTestcase, this is asserted to be correct.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 4:18 PM, Benson Margulies ben...@basistech.com wrote:
 If I'm following, you've created a division of labor between setReader and
 reset.

Thats not true. setReader shouldnt be doing any labor. its really only
a setter!

One possibility here is to make it final (though its not obvious to me
that it would clear up the situation, I think javadocs are more
important here).


 We have a tokenizer that has a good deal of state, since it has to split
 the input into chunks. If I'm following here, you'd recommend that we do
 nothing special in setReader, but have #reset fix up all the state on the
 assumption that we are are starting from the beginning of something, and
 we'd reinitialize our chunker over what was sitting in the protected
 'input'. If someone called #setReader and neglected to call #reset, awful
 things would happen, but you've warned them.

If someone called setReader and neglected to call reset, aweful things
will happen to them in general. they would be violating the contracts
of the API and the workflow described in the javadocs.

Thats why we test as much consumer code as possible against
MockTokenizer (from test-framework package). it has a state machine
that will fail if you do this.


 To me, it seemed natural to overload #setReader so that our tokenizer was
 in a consistent state once it was called. It occurs to me to wonder about
 order: if #reset is called before #setReader, I'm up creek unless I copy my
 reset implementation into a local override of #setReader.

This would also be a violation on the consumer's part (also detected
by MockTokenizer, in case you have such consumers like queryparsers or
whatever you want to test).

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org