Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Jan Høydahl / Cominvent Tue, 06 Jul 2010 15:32:29 -0700

The Char-filters MUST come before the Tokenizer, due to their nature of 
processing the character-stream and not the tokens.


If you need to apply the accent normalizatino later in the analysis chain, 
either use ISOLatin1AccentFilterFactory or help with the implementation of 
SOLR-1978.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 5. juli 2010, at 17.32, Saïd Radhouani wrote:

> Thanks Koji for the reply and for updating wiki. As it's written now in wiki, 
> it sounds (at least to me) like MappingCharFilterFactory works only with 
> WhitespaceTokenizerFactory.
> 
> Did you really mean that? Because this filter  works also with other 
> tkenizers. For instance, in my text type, I'm using StandardTokenizerFactory 
> for document processing, and  WhitespaceTokenizerFactory for query processing.
> 
> I also noticed that, in whatever order you put this filter in the definition 
> of a field type, it's always applied (during text processing) before the 
> tokenizer and all the other filters. Is there a reason for that? Is there a 
> possibility to force the filter to be applied at a certain order among the 
> other filters?
> 
> Thanks,
> -S
> 
> On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:
> 
>> 
>>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory 
>>> must be used with MappingCharFilterFactory. But, when I use these tokenizer 
>>> and filter together, I get a sever error saying that the filed type 
>>> containing these filter and tokenizer is unknown. However, when I use this 
>>> filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>>> 
>>> 
>> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
>> Tokenizers can take Reader argument in constructor. But after that,
>> because they can take CharStream argument in constructor,
>> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
>> are aware of CharStream). I'll update the wiki.
>> 
>> Koji
>> 
>> -- 
>> http://www.rondhuit.com/en/
>> 
>

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Reply via email to