[
https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212623#comment-13212623
]
Christian Moen commented on LUCENE-3767:
----------------------------------------
Mike,
Thanks a lot for this. I'd meant to comment on this earlier and I'd like to
look further into the details, but I really like your idea of running the
Viterbi in a streaming fashion.
Kuromoji originally split input using two punctuation characters as this would
be an articulation point in the lattice/graph in practice, but your idea is
much more elegant and also faithful to the statistical model.
As for dealing with compounds, the penalization is a crude hack as you know,
but it turns to work quite well in practice as many of the "decompounds" are
known to the statistical model. However, in cases where not not all of them
are known, we sometimes get wrong decomounds. I've done some analysis of these
cases and it's possible to add more heuristics to deal with some that are
obviouslt wrong, such a word starting with a long vowel sound in katakana.
This is a slippery slope that I'm reluctant to pursue...
Robert mentioned earlier that he believes IPADIC could have been annotated with
compounds as the documentation mentions them, but they're not part of the
IPADIC model we are using. If it is possible to get the decompounds from the
training data (Kyoto Corpus), a better overall approach is then to do regular
segmentation (normal mode) and then provide the decompounds directly from the
token info for the compounds. We might need to retrain the model and
preserving the decompounds in order for this to work, but I think it is worth
investigating.
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
> Key: LUCENE-3767
> URL: https://issues.apache.org/jira/browse/LUCENE-3767
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3767.patch, LUCENE-3767.patch, LUCENE-3767.patch,
> compound_diffs.txt
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]