[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239195#comment-13239195
 ] 

Christian Moen commented on LUCENE-3921:
----------------------------------------

I've been experimenting with the idea outlined above and I thought I should 
share some very early results.

The improvement here is basically to give the compound splitting heuristic an 
improved ability to split unknown words that are part of compounds.  
Experiments I've run using using our compound splitting test cases suggest that 
the effect is indeed positive.  The improved heuristic is able to handle some 
of the test case that we couldn't do earlier, but all of this requires further 
experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also 
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it 
also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).

It might be possible to tune this up or developer a more advanced heuristic 
that remedies this, but I haven't had a chance to look further into this.  
Also, any change here would require extensive testing and validation.  See the 
evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, 
but we can follow up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting 
katakana words that start with ン、ッ、ー since we don't want tokens that start with 
these characters and consider adding this as an option to the tokenizer if it 
works well.

Having said this, there are real limits to what we can achieve by hacking the 
statistical model (and it also affects our karma, you know...).  The approach 
above also has performance and memory impact.  We'd need to introduce a fairly 
short limits to how long unknown words can be and this can perhaps only apply 
to unknown katakana words. The length restriction will be big enough to not 
have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets.  I 
think we'd get pretty far for katakana if we apply some of the corpus-based 
compound-splitting algorithms Europeans NLP researchers have developed.  These 
algorithms are simple and quite effective.

Thoughts?

                
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to