[ 
https://issues.apache.org/jira/browse/LUCENE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753766#action_12753766
 ] 

Uwe Schindler commented on LUCENE-1906:
---------------------------------------

bq. Maybe for 3.0 we can declare that this will become a CharStream?

I think the instanceof check is less evil than:

{quote}
bq. Yes, it's relatively fast, but it's per-token too.

It is once per token. But you do not need to wrap the input Reader using 
CharReader if you do not want to use CharFilters. If you wrap each call to 
Reader by CharReader you have a larger overhead (one additional method call per 
char read, if you tokenize using Reader.read()!).
{quote}

I think we should wait a while and think one night about it. Lets move RC4 to 
tomorrow morning.

We have both possibilities, let's collect arguments +/-

bq. Excellent point. Hadn't seen it before or didn't remember it.

He brought this up several times in the TokenStream discussion. This is why we 
mad this very fancy backwards layer that works with many special usages of 
TokenStreams like subclassing Token and so on (see extra BW Test). Hard stuff 
:-) And this because of this argument.

> Problem with CharStream and Tokenizers with custom reset(Reader) method
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1906
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1906
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>            Priority: Blocker
>             Fix For: 2.9
>
>         Attachments: backwards-break.patch, LUCENE-1906.patch, 
> LUCENE-1906.patch, LUCENE-1906_contrib.patch
>
>
> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> {code}
>   public void reset(Reader input) throws IOException {
>     super.reset(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>   }
> {code}
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards compatible
> and we do not need to add additional ctors in subclasses.
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
> Any thoughts?
> I would like to fix this somehow before RC4, I'm, sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to