The current ordering of JapaneseAnalyser's token filters is as follows:
1. JapaneseBaseFormFilter
2. JapanesePartOfSpeechStopFilter
3. CJKWidthFilter (similar to NormaliseFilter)
4. StopFilter
5. JapaneseKatakanaStemFilter
6. LowerCaseFilter
Our existing support for English applies token filters in the following order:
1. Various tokenisation hacks which we use to avoid having to fix
the tokeniser itself.
2. Normalisation
2.1 NormaliseFilter
2.2. LowerCaseFilter
3. StopFilter
4. PorterStemFilter
I'm wondering a couple of things.
1) Is it {right/intentional/sane} that in JapaneseAnalyser, one
stemming filter is before the normalisation (JapaneseBaseFormFilter)
and another (JapaneseKatakanaStemFilter) is after it?
2) How much leeway do I have with changing the order? Ideally, I would
like to line up the pipeline something like this:
1. Tokenisation
If English, StandardTokeniser plus all our hacks.
If Japanese, JapaneseTokeniser.
2. Normalisation
NormaliseFilter
LowerCaseFilter
3. Stop words (user can opt out of this feature)
If Japanese, JapanesePartOfSpeechStopFilter
StopFilter (list of stop words different per language)
4. Stemming (user can opt out of this feature)
If English, PorterStemFilter
If Japanese, JapaneseBaseFormFilter
If Japanese, JapanesePartOfSpeechStopFilter
Stop words and stemming could be swapped, but the main thing is that
the user setting dependent parts would be grouped together into a
fairly logical arrangement, instead of the method just becoming a
spaghetti mess of different option checks.
Daniel
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]