[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854906#action_12854906
 ] 

Robert Muir commented on LUCENE-2384:
-------------------------------------

bq. For JFlex this does not help as the Jflex-generated code always needs a 
Reader.

This can be fixed. Currently all I/O in all tokenizers is broken and buggy, and 
does not correctly handle special cases around their 'buffering'.

The only one that is correct is CharTokenizer, but at what cost? It has so much 
complexity because of this Reader issue.

We should stop pretending like we can really stream docs with Reader.
We should stop pretending like 8GB documents or something exist, where we cant 
just analyze the whole doc at once and make things simple.
And then we can fix the lucene tokenizers to be correct.


> Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
> -------------------------------------------------------------
>
>                 Key: LUCENE-2384
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2384
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: Analysis
>    Affects Versions: 3.0.1
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: reset.diff
>
>
> When indexing large documents, the lexer buffer may stay large forever. This 
> sub-issue resets the lexer buffer back to the default on reset(Reader).
> This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to