[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

Lance Norskog (JIRA) Fri, 20 Jul 2012 01:07:41 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418997#comment-13418997
 ]


Lance Norskog commented on SOLR-3653:
-------------------------------------

The SmartChineseWordTokenFilter is a statistical algorithm (Hidden Markov Model 
to be exact) which was trained on a corpus of training text. It's purpose is to 
split text into "words", which are singles, bigrams and occasionally trigrams 
of Simplified Chinese ideograms (letters). It does a very good job, but since 
it is statistically based it is not perfect. When it fails, it emits "words" 
that are 4 or more ideograms. These are really phrases. These phrases contain 
real words which should be searchable.

The attached PDF of the Analysis page shows the problem. Chinese legal text 
proved a pathological case and created a 7-ideogram word. In order to make 
parts of this text searchable, the 7-letter phrase has to be broken into 
n-grams. Unigrams give more recall while bigrams give more precision. 

This patch includes a new SmartChineseBigramFilter takes any words not split by 
the WordTokenFilter and creates bigrams from them. The bigrams only span the 
unsplit phrase. They do not overlap between two adjoining unsplit phrases. The 
attached PDF shows this effect as well between the first and second unsplit 
phrases.

I am not an expert on the Chinese language or the HMM technology used in the 
Smart Chinese toolkit. I created the bigram filter after difficulties 
attempting to supply a high-quality search experience for Chinese legal 
documents. This is a straw-man solution to the problem. If you know better, 
please say so and we will iterate.

The patch includes a 'text_zh' field type which includes the bigram filter. The 
bigram filter is essential if 'text_zh' is to be the preferred recommendation.
                
> Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr 
> factories. Also, since it is a statistical algorithm, it is not perfect.
> This patch supplies factories and a schema.xml type for the existing Lucene 
> Smart Chinese implementation, and includes a "fixup" class to handle the 
> occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

Reply via email to