[ https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418997#comment-13418997 ]
Lance Norskog commented on SOLR-3653: ------------------------------------- The SmartChineseWordTokenFilter is a statistical algorithm (Hidden Markov Model to be exact) which was trained on a corpus of training text. It's purpose is to split text into "words", which are singles, bigrams and occasionally trigrams of Simplified Chinese ideograms (letters). It does a very good job, but since it is statistically based it is not perfect. When it fails, it emits "words" that are 4 or more ideograms. These are really phrases. These phrases contain real words which should be searchable. The attached PDF of the Analysis page shows the problem. Chinese legal text proved a pathological case and created a 7-ideogram word. In order to make parts of this text searchable, the 7-letter phrase has to be broken into n-grams. Unigrams give more recall while bigrams give more precision. This patch includes a new SmartChineseBigramFilter takes any words not split by the WordTokenFilter and creates bigrams from them. The bigrams only span the unsplit phrase. They do not overlap between two adjoining unsplit phrases. The attached PDF shows this effect as well between the first and second unsplit phrases. I am not an expert on the Chinese language or the HMM technology used in the Smart Chinese toolkit. I created the bigram filter after difficulties attempting to supply a high-quality search experience for Chinese legal documents. This is a straw-man solution to the problem. If you know better, please say so and we will iterate. The patch includes a 'text_zh' field type which includes the bigram filter. The bigram filter is essential if 'text_zh' is to be the preferred recommendation. > Support Smart Simplified Chinese in Solr - include clean-up bigramming filter > ----------------------------------------------------------------------------- > > Key: SOLR-3653 > URL: https://issues.apache.org/jira/browse/SOLR-3653 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Lance Norskog > Attachments: SOLR-3653.patch, SmartChineseType.pdf > > > The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr > factories. Also, since it is a statistical algorithm, it is not perfect. > This patch supplies factories and a schema.xml type for the existing Lucene > Smart Chinese implementation, and includes a "fixup" class to handle the > occasional mistake made by the Smart Chinese implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org