Am 19.11.2012 17:44, schrieb Carsten Schnober:
Hi again,
just a little update:
> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
> final Tokenizer source = new KoraTokenizer(reader);
> final TokenStreamComponents tokenstream = new
> KoraTokenStreamComponents(source);
> try {
> source.close();
> } catch (IOException e) {
> jlog.error(e.getLocalizedMessage());
> e.printStackTrace();
> }
> return tokenstream;
> }
When using the packaged Analyzer.TokenStreamComponents class instead of
my custom KoraTokenStreamComponents class, the behaviour does not seem
to change:
- final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
+ final TokenStreamComponents tokenstream = new
TokenStreamComponents(source);
Best,
Carsten
--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | [email protected]
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]