[ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194509#comment-13194509
 ] 

Christian Moen commented on LUCENE-3726:
----------------------------------------

These are very interesting questions, Robert.  Please find my comments below.

{quote}
should these parameters continue to be static-final, or configurable?
{quote}

It's perhaps possible to make these configurable, but I think we'd be exposing 
configuration that is most likely to confuse most users rather than help them.

The values currently uses have been found using some analysis and 
experimentation, and they can probably be improved both in terms of tuning and 
with added heuristics -- in particular for katakana compounds (more below).

However, changing and improving this requires quite detailed analysis and 
testing, though.  I think the major case for exposing them is as a means for 
easily tuning them rather than these parameters being generally useful to users.

{quote}
should POS also play a role in the algorithm (can/should we refine exactly what 
we decompound)?
{quote}

Very good question and an interesting idea.

In the case of long kanji words such as 関西国際空港 (Kansai International Airport), 
which is a known noun, we can possible use POS info as a hint for applying the 
Viterbi penalty.  In the case of unknown kanji, Kuromoji unigrams them.  
(関西国際空港 becomes 関西  国際  空港 (Kansai International Airport) using search mode.)

Katakana compounds such as シニアソフトウェアエンジニア (senior software engineer) becomes 
one token without search mode, but when search mode is used, we get three 
tokens シニア  ソフトウェア  エンジニア as you would expect.  It's also the case that 
シニアソフトウェアエンジニア is an unknown word, but its constituents become known and get 
the correct POS after search mode. 

In general, unknown words get a noun-POS (名詞) so the idea of using POS here 
should be fine.

There are some problems with the katakana decompounding in search mode.  For 
example, コニカミノルタホールディングス (Konika Minolta Holdings) becomes コニカ  ミノルタ  ホール  
ディングス  (Konika Minolta horu dings), where we get the additional token ホール (also 
means hall, in Japanese).

To sum up, I think we can potentially use the noun-POS as a hint when doing the 
decompounding in search mode, but I'm not sure how much we will benefit from 
it, but I like the idea.  I think we'll benefit most from an improved heuristic 
for non-kanji to improve katakana decompounding.

Let me have a tinker and see how I can improve this.

{quote}
is the Tokenizer the best place to do this, or should we do it in a 
tokenfilter? or both?
{quote}

Interesting idea and good point regarding IDF.

In order do the decompoundning, we'll need access to the lattice and add 
entries to it before we run the Viterbi.  If we do normal segmentation first 
then run a decompounding filter, I think we'll need to run the Viterbi twice in 
order to get the desired results.  (Optimizations are possible, though.)

I'm thinking a possibility could be to expose possible decompounds as part of 
Kuromoji's Token interface.  We can potentially have something like

{code:title=Token.java}

/**
 * Returns a list of possible decompounds for this token found by a heuristic
 * 
 * @return a list of candidate decompounds or null of none is found
 */
List<Token> getDecompounds() {
  // ...
}
{code}

In the case of シニアソフトウェアエンジニア, the current token would have surface form 
シニアソフトウェアエンジニア, but with tokens シニア, ソフトウェア and エンジニア accessible using 
{{getDecompounds()}}.

As a general notice, I should point our that how well the heuristics performs 
depends on the dictionary/statistical model used (i.e. IPADIC) and if we might 
want to make different heuristics for each of those we support as needed.
                
> Default KuromojiAnalyzer to use search mode
> -------------------------------------------
>
>                 Key: LUCENE-3726
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3726
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>
> Kuromoji supports an option to segment text in a way more suitable for search,
> by preventing long compound nouns as indexing terms.
> In general 'how you segment' can be important depending on the application 
> (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
> in chinese)
> The current algorithm punishes the cost based on some parameters 
> (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
> for long runs of kanji.
> Some questions (these can be separate future issues if any useful ideas come 
> out):
> * should these parameters continue to be static-final, or configurable?
> * should POS also play a role in the algorithm (can/should we refine exactly 
> what we decompound)?
> * is the Tokenizer the best place to do this, or should we do it in a 
> tokenfilter? or both?
>   with a tokenfilter, one idea would be to also preserve the original 
> indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
>   from my understanding this tends to help with noun compounds in other 
> languages, because IDF of the original term boosts 'exact' compound matches.
>   but does a tokenfilter provide the segmenter enough 'context' to do this 
> properly?
> Either way, I think as a start we should turn on what we have by default: its 
> likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to