[ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461572#comment-13461572
 ] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

I ran some counts on a database of 300k Chinese legal documents. The index has 
a unigram field based on the StandardAnalyzer, a bigram field based on the CJK 
analyzer, and a Smart Chinese field. I pulled the terms for all of them and 
filtered for Chinese ideograms only. These are text unigrams, with 

* The unigram field had 55k terms. 
* The bigram field had 1.8 million terms. 
* The Smart Chinese field had 417k terms:
** unigrams: 9.6k
** bigrams: 40k
** trigrams: 14.6k
** four: 5.6k
** five: 300
** six: 70
** seven: 51
** eight: 19
** nine: 7
** ten: 2
** eleven: 3
** twelve: 2
** thirteen: 3

The 4+ ngrams are essentially parsing failures by the Smart Chinese tokenizer. 
I have attached three Google Translate versions of the longer ngrams. 
'translations_first_500.trigrams.txt' and 'translations_first_500.quad.txt' are 
the most common 3-ideogram and 4-ideogram terms. They have a lot of phrases 
which should have been split.  'translations_450.five2thirteen.txt' are 450 
ngrams which are 5 ideograms or longer.  The longer ones have a lot of formal 
geographical names, government organization names and official propaganda 
phrases, more as the length increases. 

For this corpus, based the above breakdown and on other experience:
# CJK is a waste of disk space. Bigrams introduce a ton of noise.
# Unigrams might work well if you only do strict phrase searches. But searching 
for A, B, and C separately when given ABC is useless.
# If you search for raw country names, Smart Chinese lets you down when the 
document uses the formal name. 

Smart Chinese really does need to be split into bigrams. To cut bigram noise, I 
would take the database of bigrams that it generates, and then use these to 
guide splitting 3+ grams into bigrams. That is, if it ever generates AB, then 
the splitter turns ABCD into (AB CD). BC would be considered 'bigram noise'. 
Similarly, if Smart Chinese generates EF, then DEFG would become (D EF G).

However, a good fallback would be to have two fields, Smart Chinese and 
unigrams, with Smart Chinese boosted upwards and unigrams only with strict 
phrase search. With a high term count, bigrams are not helpful. You might even 
want to search Smart Chinese first, and then do unigram loose phrase search 
only if the recall is too low or the user is unhappy with the Smart Chinese 
results.

                
> Custom bigramming filter for to handle Smart Chinese edge cases
> ---------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SmartChineseType.pdf, SOLR-3653.patch
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not 
> work in some edge cases. It fails to split certain words which were not part 
> of the dictionary or training corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. 
> The algorithm creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to