[ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419729#comment-13419729
 ] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

The toolkit cannot be perfect, and this patch solves one problem I have found 
with it. This example is a good example of the edge case: "个人所得税" 
(personal/individual "income tax"). The first two letters are a common modifier 
"individual or personal" and the last three letters mean (roughly) "income 
tax". The latter should be an indexable unit. For my sanity I'll refer to the 
word as ABCDE. The letters A and AB are common modifiers. 

In cases where CDE' is not in the dictionary, the toolkit should split the word 
between ABCDE' at AB. This would be the point of using a statistical model 
instead of merely relying on a dictionary, right?  

I found the CJK bigram system clumsy and the unigram approach to be useless. If 
the user of the Smart Chinese toolkit has to assume that it is perfect, then it 
is only useable for the context of the training data. It has proved a nice 
general solution to the problem and it would be a shame to render it useless 
because of this one rough edge. 
                
> Custom bigramming filter for to handle Smart Chinese edge cases
> ---------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not 
> work in some edge cases. It fails to split certain words which were not part 
> of the dictionary or training corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. 
> The algorithm creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to