Payload class

2012-08-29 Thread Benson Margulies
I'm failing to find advice in MIGRATE.txt on how to replace 'new
Payload(...)' in migrating to 4.0.  What am I missing?


Re: Payload class

2012-08-29 Thread Robert Muir
Replace with BytesRef (https://issues.apache.org/jira/browse/LUCENE-4122)

I'll add a note to MIGRATE.txt about this.

On Wed, Aug 29, 2012 at 9:46 AM, Benson Margulies ben...@basistech.com wrote:
 I'm failing to find advice in MIGRATE.txt on how to replace 'new
 Payload(...)' in migrating to 4.0.  What am I missing?


-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ResourceLoader?

2012-08-29 Thread Benson Margulies
Our Solr 3.x code used init(ResourceLoader) and then called the loader to
read a file.

What's the new approach to reading content from files in the 'usual place'?


Re: ResourceLoader?

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com wrote:
 Our Solr 3.x code used init(ResourceLoader) and then called the loader to
 read a file.

 What's the new approach to reading content from files in the 'usual place'?

I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is
that what you meant?

I added some javadocs on the lifecycle of these factories the other
day (please review, possible doc bugs!):
https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

Here are some examples:

Parses a tab-separated file (using getLines: UTF-8):
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java

Parses a file of its own format (using specified encoding):
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
That's what I meant, thanks.

On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com
 wrote:
  Our Solr 3.x code used init(ResourceLoader) and then called the loader to
  read a file.
 
  What's the new approach to reading content from files in the 'usual
 place'?

 I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is
 that what you meant?

 I added some javadocs on the lifecycle of these factories the other
 day (please review, possible doc bugs!):

 https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

 Here are some examples:

 Parses a tab-separated file (using getLines: UTF-8):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java

 Parses a file of its own format (using specified encoding):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use
it?


On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com
 wrote:
  Our Solr 3.x code used init(ResourceLoader) and then called the loader to
  read a file.
 
  What's the new approach to reading content from files in the 'usual
 place'?

 I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is
 that what you meant?

 I added some javadocs on the lifecycle of these factories the other
 day (please review, possible doc bugs!):

 https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

 Here are some examples:

 Parses a tab-separated file (using getLines: UTF-8):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java

 Parses a file of its own format (using specified encoding):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Using a char filter in solr createComponents

2012-08-29 Thread Benson Margulies
I'm close to the bottom of my list here.

I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream
method. So now I have to migrate that to createComponents. Can someone give
me a shove in the right direction?


Re: ResourceLoader?

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote:
 I'm confused. Isn't inform/ResourceLoader deprecated? But your example use
 it?


Where is it deprecated? What does the deprecation message say?

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 wrote:
  I'm confused. Isn't inform/ResourceLoader deprecated? But your example
 use
  it?
 

 Where is it deprecated? What does the deprecation message say?


I see. It moved from one package to another. Sorry for the noise.


 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Using a char filter in solr createComponents

2012-08-29 Thread Robert Muir
sure. you need to wrap the reader in Analyzer.initReader() to do this:

  /**
   * Override this if you want to add a CharFilter chain.
   */
  protected Reader initReader(String fieldName, Reader reader) {
return reader;
  }

by default it returns the original Reader unchanged.

There are examples in the analysis package javadocs
(https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/analysis/package-summary.html#package_description),
but we don't yet have one that uses a CharFilter. I'll see if i can
add one.

On Wed, Aug 29, 2012 at 10:29 AM, Benson Margulies ben...@basistech.com wrote:
 I'm close to the bottom of my list here.

 I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream
 method. So now I have to migrate that to createComponents. Can someone give
 me a shove in the right direction?



-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
Hang on:

[deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
org.apache.solr.util.plugin has been deprecated



On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 wrote:
  I'm confused. Isn't inform/ResourceLoader deprecated? But your example
 use
  it?
 

 Where is it deprecated? What does the deprecation message say?

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ResourceLoader?

2012-08-29 Thread Robert Muir
Right and what does the @deprecated message say :)

On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies ben...@basistech.com wrote:
 Hang on:

 [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
 org.apache.solr.util.plugin has been deprecated



 On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 wrote:
  I'm confused. Isn't inform/ResourceLoader deprecated? But your example
 use
  it?
 

 Where is it deprecated? What does the deprecation message say?

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ResourceLoader?

2012-08-29 Thread Chris Male
Yeah I deprecated that originally when I first started the work on
migrating the Analysis Factories, to reduce the upgrade load.

On Thu, Aug 30, 2012 at 2:40 AM, Benson Margulies ben...@basistech.comwrote:

 Hang on:

 [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
 org.apache.solr.util.plugin has been deprecated



 On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

  On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 
  wrote:
   I'm confused. Isn't inform/ResourceLoader deprecated? But your example
  use
   it?
  
 
  Where is it deprecated? What does the deprecation message say?
 
  --
  lucidworks.com
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 




-- 
Chris Male | Open Source Search Developer | elasticsearch |
www.ehttp://www.dutchworks.nl
lasticsearch.com


Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir rcm...@gmail.com wrote:

 Right and what does the @deprecated message say :)


Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain
turned off. I'm better now.



 On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies ben...@basistech.com
 wrote:
  Hang on:
 
  [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
  org.apache.solr.util.plugin has been deprecated
 
 
 
  On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:
 
  On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies 
 ben...@basistech.com
  wrote:
   I'm confused. Isn't inform/ResourceLoader deprecated? But your example
  use
   it?
  
 
  Where is it deprecated? What does the deprecation message say?
 
  --
  lucidworks.com
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Using a char filter in solr createComponents

2012-08-29 Thread Robert Muir
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/analysis/package.html?r1=1378591r2=1378590pathrev=1378591

On Wed, Aug 29, 2012 at 10:35 AM, Robert Muir rcm...@gmail.com wrote:
 sure. you need to wrap the reader in Analyzer.initReader() to do this:

   /**
* Override this if you want to add a CharFilter chain.
*/
   protected Reader initReader(String fieldName, Reader reader) {
 return reader;
   }

 by default it returns the original Reader unchanged.

 There are examples in the analysis package javadocs
 (https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/analysis/package-summary.html#package_description),
 but we don't yet have one that uses a CharFilter. I'll see if i can
 add one.

 On Wed, Aug 29, 2012 at 10:29 AM, Benson Margulies ben...@basistech.com 
 wrote:
 I'm close to the bottom of my list here.

 I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream
 method. So now I have to migrate that to createComponents. Can someone give
 me a shove in the right direction?



 --
 lucidworks.com



-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using coord of one BooleanQuery as a multiplier for its siblings

2012-08-29 Thread Robert Muir
On Mon, Aug 27, 2012 at 8:31 PM, pranshu sharma pran...@cis.upenn.edu wrote:
 Hi there,

 I had a question about migrating the coord value one level up. My current
 query structure has a root BooleanQuery with a bunch of nested BooleanQuery
 children: one of these looks for all terms in the query issued, and I want
 to apply the coord factor for this BooleanQuery to all its siblings. The
 idea here is to have the overall score reflect the number of tokens matched.

 For example, if the query is 'apple banana orange', my query structure
 looks like this:

 BQ0 (root): coord = 1
 BQ1: (coord enabled)
 BQ1a: +apple
 BQ1b: +banana
 BQ1c: +orange
 BQ2: (coord = 1)
 BQ2a: 'apple banana'
 BQ2b: 'banana orange'
 ...

 (BQ2 basically takes care of bigrams.)

 Now, if I have two documents that look like:

 Doc A: apple orange banana
 Doc B: apple banana

 Doc A would get a higher overall score if the score from BQ2 is multiplied
 with coord for BQ1 (=2/3), than if it wasn't. Could you recommend how I go
 about implementing this idea?

I think you want to use BooleanQuery's disableCoord option for BQ2 and BQ0 ?

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I've read the javadoc through a few times, but I confess that I'm still
feeling dense.

Are all tokenizers responsible for implementing some way of retaining the
contents of their reader, so that a call to reset without a call to
setReader rewinds? I note that CharTokenizer doesn't implement #reset,
which leads me to suspect that I'm not responsible for the rewind behavior.


Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
ok, lets help improve it: I think these have likely always been confusing.

before they were both reset: reset() and reset(Reader), even though
they are unrelated. I thought the rename would help this :)

Does the TokenStream workfloat here help?
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
Basically reset() is a mandatory thing the consumer must call. it just
means 'reset any mutable state so you can be reused for processing
again'.
This is something on any TokenStream: Tokenizers, TokenFilters, or
even some direct descendent you make that parses byte arrays, or
whatever.

This means if you are keeping some state across tokens (like
stopfilter's #skippedTokens). here is where you would set that = 0
again.

setReader(Reader) is only on Tokenizer, it means replace the Reader
with a different one to be processed.
The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
bogus IMO, but I dont think it will cause any bugs. Don't emulate it
:)

On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote:
 I've read the javadoc through a few times, but I confess that I'm still
 feeling dense.

 Are all tokenizers responsible for implementing some way of retaining the
 contents of their reader, so that a call to reset without a call to
 setReader rewinds? I note that CharTokenizer doesn't implement #reset,
 which leads me to suspect that I'm not responsible for the rewind behavior.



-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote:

 ok, lets help improve it: I think these have likely always been confusing.

 before they were both reset: reset() and reset(Reader), even though
 they are unrelated. I thought the rename would help this :)

 Does the TokenStream workfloat here help?

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
 Basically reset() is a mandatory thing the consumer must call. it just
 means 'reset any mutable state so you can be reused for processing
 again'.


I really did read this. setReader I get; I don't understand what reset
accomplishes. What does it mean to reuse one a TokenStream without calling
setReader to supply a new input? If it means reuse the old input, who does
the rewinding?





 This is something on any TokenStream: Tokenizers, TokenFilters, or
 even some direct descendent you make that parses byte arrays, or
 whatever.

 This means if you are keeping some state across tokens (like
 stopfilter's #skippedTokens). here is where you would set that = 0
 again.

 setReader(Reader) is only on Tokenizer, it means replace the Reader
 with a different one to be processed.
 The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
 bogus IMO, but I dont think it will cause any bugs. Don't emulate it
 :)

 On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com
 wrote:
  I've read the javadoc through a few times, but I confess that I'm still
  feeling dense.
 
  Are all tokenizers responsible for implementing some way of retaining the
  contents of their reader, so that a call to reset without a call to
  setReader rewinds? I note that CharTokenizer doesn't implement #reset,
  which leads me to suspect that I'm not responsible for the rewind
 behavior.



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:45 PM, Benson Margulies ben...@basistech.com wrote:
 On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote:

 ok, lets help improve it: I think these have likely always been confusing.

 before they were both reset: reset() and reset(Reader), even though
 they are unrelated. I thought the rename would help this :)

 Does the TokenStream workfloat here help?

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
 Basically reset() is a mandatory thing the consumer must call. it just
 means 'reset any mutable state so you can be reused for processing
 again'.


 I really did read this. setReader I get; I don't understand what reset
 accomplishes. What does it mean to reuse one a TokenStream without calling
 setReader to supply a new input?

TokenStream is more generic, it doesnt have to take Reader. It can
take anything you want: e.g. a String or a byte array of your Word
document or whatever.

Tokenizer is a subclass that takes Reader. its the only thing that has
setReader.

reset() doesnt mean rewind. it just means clearing any accumulated
internal state so its ready for processing again.

so if i made a StringTokenizer class that extends Tokenizer, i would
probably add setString(String s) to it so i could set new string
objects on it, but consumers
must always call reset() on the entire chain (the outer stopfilters,
synonym filters, all this stuff that might be keeping state). this
reset() call chains down
all tokenstreams.

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
 Some interlinear commentary on the doc.

* Resets this stream to the beginning.

To me this implies a rewind.  As previously noted, I don't see how this
works for the existing implementations.

   * As all TokenStreams must be reusable,
   * any implementations which have state that needs to be reset between
usages
   * of the TokenStream, must implement this method. Note that if your
TokenStream
   * caches tokens and feeds them back again after a reset,

What's the alternative? What happens with all the existing Tokenizers that
have no special implementation of #reset()?

   * it is imperative
   * that you clone the tokens when you store them away (on the first pass)
as
   * well as when you return them (on future passes after {@link #reset()}).


Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I think I'm beginning to get the idea. Is the following plausible?

At the bottom of the stack, there's an actual source of data -- like a
tokenizer. For one of those, reset() is a bit silly, and something like
setReader is the brains of the operation.

Some number of other components may be stacked up on top of the source of
data, and these may have local state. Calling #reset prepared them for new
data to emerge from the actual source of data.


Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:54 PM, Benson Margulies ben...@basistech.com wrote:
  Some interlinear commentary on the doc.

 * Resets this stream to the beginning.

 To me this implies a rewind.  As previously noted, I don't see how this
 works for the existing implementations.

its not a rewind. the javadocs here are not good. we need to fix them
to be clear :)


* As all TokenStreams must be reusable,
* any implementations which have state that needs to be reset between
 usages
* of the TokenStream, must implement this method. Note that if your
 TokenStream
* caches tokens and feeds them back again after a reset,

 What's the alternative? What happens with all the existing Tokenizers that
 have no special implementation of #reset()?

perhaps these Tokenizers have no state to reset()? lots of tokenstream
classes are stateless.
if you are stateless, then you dont need to implement this method. You
get the default implementation: e.g. TokenFilter's just passes it down
the chain (input.reset()), and i think Tokenizer/TokenStream are
no-ops.


-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 3:58 PM, Benson Margulies ben...@basistech.com wrote:
 I think I'm beginning to get the idea. Is the following plausible?

 At the bottom of the stack, there's an actual source of data -- like a
 tokenizer. For one of those, reset() is a bit silly, and something like
 setReader is the brains of the operation.

Actually i think setReader() is silly in most cases for Tokenizers.
Most tokenizers should never override this (in fact technically we
could make it final or something, to make it super-clear, but that
might be a bit over the top)

The default implementation in Tokenizer.java should almost always
suffice, as it does what you expect a setter would do in java:

  public void setReader(Reader input) throws IOException {
assert input != null: input must not be null;
this.input = input;
  }

So lets take your CharTokenizer example:

  @Override
  public void setReader(Reader input) throws IOException {
super.setReader(input);
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Really this is bogus, i think it should not override this method at
all, and instead should do:

  @Override
  public void reset() throws IOException {
// reset our internal state
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Does that make sense?

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
If I'm following, you've created a division of labor between setReader and
reset.

We have a tokenizer that has a good deal of state, since it has to split
the input into chunks. If I'm following here, you'd recommend that we do
nothing special in setReader, but have #reset fix up all the state on the
assumption that we are are starting from the beginning of something, and
we'd reinitialize our chunker over what was sitting in the protected
'input'. If someone called #setReader and neglected to call #reset, awful
things would happen, but you've warned them.

To me, it seemed natural to overload #setReader so that our tokenizer was
in a consistent state once it was called. It occurs to me to wonder about
order: if #reset is called before #setReader, I'm up creek unless I copy my
reset implementation into a local override of #setReader.


RE: reset versus setReader on TokenStream

2012-08-29 Thread Uwe Schindler
Hi,
 
 To me, it seemed natural to overload #setReader so that our tokenizer was in a
 consistent state once it was called. It occurs to me to wonder about
 order: if #reset is called before #setReader, I'm up creek unless I copy my 
 reset
 implementation into a local override of #setReader.

The order is defined in TokenStream and Tokenizer JavaDocs. First call 
setReader on the Tokenizer and after that the *consumer* has to call reset() on 
the chain of filters. When a user uses your Tokenizer, he will set a new Reader 
and then pass it to the indexer. Indexer (the consumer) will then call reset() 
before incrementToken() is called for the first time. In Lucene's 
BaseTokenStreamTestcase, this is asserted to be correct.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: reset versus setReader on TokenStream

2012-08-29 Thread Robert Muir
On Wed, Aug 29, 2012 at 4:18 PM, Benson Margulies ben...@basistech.com wrote:
 If I'm following, you've created a division of labor between setReader and
 reset.

Thats not true. setReader shouldnt be doing any labor. its really only
a setter!

One possibility here is to make it final (though its not obvious to me
that it would clear up the situation, I think javadocs are more
important here).


 We have a tokenizer that has a good deal of state, since it has to split
 the input into chunks. If I'm following here, you'd recommend that we do
 nothing special in setReader, but have #reset fix up all the state on the
 assumption that we are are starting from the beginning of something, and
 we'd reinitialize our chunker over what was sitting in the protected
 'input'. If someone called #setReader and neglected to call #reset, awful
 things would happen, but you've warned them.

If someone called setReader and neglected to call reset, aweful things
will happen to them in general. they would be violating the contracts
of the API and the workflow described in the javadocs.

Thats why we test as much consumer code as possible against
MockTokenizer (from test-framework package). it has a state machine
that will fail if you do this.


 To me, it seemed natural to overload #setReader so that our tokenizer was
 in a consistent state once it was called. It occurs to me to wonder about
 order: if #reset is called before #setReader, I'm up creek unless I copy my
 reset implementation into a local override of #setReader.

This would also be a violation on the consumer's part (also detected
by MockTokenizer, in case you have such consumers like queryparsers or
whatever you want to test).

-- 
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org