[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418997#comment-13418997
]
Lance Norskog commented on SOLR-3653:
-------------------------------------
The SmartChineseWordTokenFilter is a statistical algorithm (Hidden Markov Model
to be exact) which was trained on a corpus of training text. It's purpose is to
split text into "words", which are singles, bigrams and occasionally trigrams
of Simplified Chinese ideograms (letters). It does a very good job, but since
it is statistically based it is not perfect. When it fails, it emits "words"
that are 4 or more ideograms. These are really phrases. These phrases contain
real words which should be searchable.
The attached PDF of the Analysis page shows the problem. Chinese legal text
proved a pathological case and created a 7-ideogram word. In order to make
parts of this text searchable, the 7-letter phrase has to be broken into
n-grams. Unigrams give more recall while bigrams give more precision.
This patch includes a new SmartChineseBigramFilter takes any words not split by
the WordTokenFilter and creates bigrams from them. The bigrams only span the
unsplit phrase. They do not overlap between two adjoining unsplit phrases. The
attached PDF shows this effect as well between the first and second unsplit
phrases.
I am not an expert on the Chinese language or the HMM technology used in the
Smart Chinese toolkit. I created the bigram filter after difficulties
attempting to supply a high-quality search experience for Chinese legal
documents. This is a straw-man solution to the problem. If you know better,
please say so and we will iterate.
The patch includes a 'text_zh' field type which includes the bigram filter. The
bigram filter is essential if 'text_zh' is to be the preferred recommendation.
> Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
> -----------------------------------------------------------------------------
>
> Key: SOLR-3653
> URL: https://issues.apache.org/jira/browse/SOLR-3653
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Reporter: Lance Norskog
> Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr
> factories. Also, since it is a statistical algorithm, it is not perfect.
> This patch supplies factories and a schema.xml type for the existing Lucene
> Smart Chinese implementation, and includes a "fixup" class to handle the
> occasional mistake made by the Smart Chinese implementation.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]