Hi, Shawn
Thank you for replying me.
> CJKBigramFilter shouldn't care what tokenizer you're using. It should
> work with any tokenizer. What problem are you seeing that you're trying
> to solve? What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?
I am sorry for lack of information. I tried this with Solr 5.5.5 and 7.5.0.
And here is analyzer configuration from my managed-schema.
<fieldType name="text_classic" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
And what I want to do is
1. to create CJ bigram token
2. to extract each word that contains a hyphen and stopwords as a single
token
(e.g. as-is, to-be, etc...) from CJK and English sentences.
CJKBigramFilter seems to check TOKEN_TYPES attribute added by
StandardTokenizer when creating CJK bigram token.
(See
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java#L64
)
ClassicTokenizer also adds obsolete TOKEN_TYPES "CJ" to the CJ token and
"ALPHANUM" to the Korean alphabet, but both are not targets for
CJKBigramFilter...
Thanks,
Yasufumi
2018年10月2日(火) 0:05 Shawn Heisey <[email protected]>:
> On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:
> > I am looking for the way to create CJK bigram tokens with
> ClassicTokenizer.
> > I tried this by using CJKBigramFilter, but it only supports for
> > StandardTokenizer...
>
> CJKBigramFilter shouldn't care what tokenizer you're using. It should
> work with any tokenizer. What problem are you seeing that you're trying
> to solve? What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?
>
> I don't have access to the systems where I was using that filter, but if
> I recall correctly, I was using the whitespace tokenizer.
>
> Thanks,
> Shawn
>
>