Re: TokenStreamComponents in Lucene 4.0

Carsten Schnober Mon, 19 Nov 2012 08:48:53 -0800

Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi again,
just a little update:


> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:
> 
> @Override
> protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>   final Tokenizer source = new KoraTokenizer(reader);
>   final TokenStreamComponents tokenstream = new
> KoraTokenStreamComponents(source);
>   try {
>     source.close();
>   } catch (IOException e) {
>     jlog.error(e.getLocalizedMessage());
>     e.printStackTrace();
>   }
>   return tokenstream;
> }

When using the packaged Analyzer.TokenStreamComponents class instead of
my custom KoraTokenStreamComponents class, the behaviour does not seem
to change:

-  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
+  final TokenStreamComponents tokenstream = new
TokenStreamComponents(source);

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | [email protected]
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: TokenStreamComponents in Lucene 4.0

Reply via email to