[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461572#comment-13461572
]
Lance Norskog commented on SOLR-3653:
-------------------------------------
I ran some counts on a database of 300k Chinese legal documents. The index has
a unigram field based on the StandardAnalyzer, a bigram field based on the CJK
analyzer, and a Smart Chinese field. I pulled the terms for all of them and
filtered for Chinese ideograms only. These are text unigrams, with
* The unigram field had 55k terms.
* The bigram field had 1.8 million terms.
* The Smart Chinese field had 417k terms:
** unigrams: 9.6k
** bigrams: 40k
** trigrams: 14.6k
** four: 5.6k
** five: 300
** six: 70
** seven: 51
** eight: 19
** nine: 7
** ten: 2
** eleven: 3
** twelve: 2
** thirteen: 3
The 4+ ngrams are essentially parsing failures by the Smart Chinese tokenizer.
I have attached three Google Translate versions of the longer ngrams.
'translations_first_500.trigrams.txt' and 'translations_first_500.quad.txt' are
the most common 3-ideogram and 4-ideogram terms. They have a lot of phrases
which should have been split. 'translations_450.five2thirteen.txt' are 450
ngrams which are 5 ideograms or longer. The longer ones have a lot of formal
geographical names, government organization names and official propaganda
phrases, more as the length increases.
For this corpus, based the above breakdown and on other experience:
# CJK is a waste of disk space. Bigrams introduce a ton of noise.
# Unigrams might work well if you only do strict phrase searches. But searching
for A, B, and C separately when given ABC is useless.
# If you search for raw country names, Smart Chinese lets you down when the
document uses the formal name.
Smart Chinese really does need to be split into bigrams. To cut bigram noise, I
would take the database of bigrams that it generates, and then use these to
guide splitting 3+ grams into bigrams. That is, if it ever generates AB, then
the splitter turns ABCD into (AB CD). BC would be considered 'bigram noise'.
Similarly, if Smart Chinese generates EF, then DEFG would become (D EF G).
However, a good fallback would be to have two fields, Smart Chinese and
unigrams, with Smart Chinese boosted upwards and unigrams only with strict
phrase search. With a high term count, bigrams are not helpful. You might even
want to search Smart Chinese first, and then do unigram loose phrase search
only if the recall is too low or the user is unhappy with the Smart Chinese
results.
> Custom bigramming filter for to handle Smart Chinese edge cases
> ---------------------------------------------------------------
>
> Key: SOLR-3653
> URL: https://issues.apache.org/jira/browse/SOLR-3653
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Reporter: Lance Norskog
> Attachments: SmartChineseType.pdf, SOLR-3653.patch
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not
> work in some edge cases. It fails to split certain words which were not part
> of the dictionary or training corpus.
> This patch supplies a bigramming class to handle these occasional mistakes.
> The algorithm creates bigrams out of all "words" longer than two ideograms.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]