[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470967#comment-13470967
 ] 

Christian Moen commented on LUCENE-3921:
----------------------------------------

Lance,

The idea I had in mind for Japanese uses language specific characteristics for 
katakana terms and perhaps weights that are dictionary-specific as well.  
However, we are hacking the our statistical model here and there are 
limitations as to how far we can go with this.

I don't know a whole lot about the Smart Chinese toolkit, but I believe the 
same approach to compound segmentation could work for Chinese as well.  
However, weights and implementation would likely to be separate.  Note that the 
above is really about one specific kind of compound segmentation that applies 
to Japanese so the thinking was to add additional heuristics for this specific 
type that is particularly tricky.

It might be a good idea to approach this problem also using the 
{{DictionaryCompoundWordTokenFilter}} and collectively build some lexical 
assets for compound splitting for the relevant languages than hacking our 
models.
                
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to