[
https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214148#comment-13214148
]
Robert Muir commented on LUCENE-3767:
-------------------------------------
{quote}
Robert mentioned earlier that he believes IPADIC could have been annotated with
compounds as the documentation mentions them, but they're not part of the
IPADIC model we are using. If it is possible to get the decompounds from the
training data (Kyoto Corpus), a better overall approach is then to do regular
segmentation (normal mode) and then provide the decompounds directly from the
token info for the compounds. We might need to retrain the model and preserving
the decompounds in order for this to work, but I think it is worth
investigating.
{quote}
The dictionary documentation for the original ipadic has the ability to hold
compound data (not in mecab-ipadic though, so maybe it was never
implemented?!),
but I don't actually see it in any implementations. So yeah, we would need to
find a corpus containing compound information (and of course extend the file
format
and add support to kuromoji) to support that.
However, would this really solve the total issue? Wouldn't that really only
help for known kanji compounds... whereas most katakana compounds
(e.g. the software engineer example) are expected to be OOV anyway? So it seems
like, even if we ensured the dictionary was annotated for
long kanji such that we always used decompounded forms, we need a 'heuristical'
decomposition like search-mode either way, at least for
the unknown katakana case?
And I tend to like Mike's improvements from a relevance perspective for these
reasons:
# keeping the original compound term for improved precision
# preventing compound decomposition from having any unrelated negative impact
on the rest of the tokenization
So I think we should pursue this change, even if we want to separately train a
dictionary in the future, because in that case,
we would just disable the kanji decomposition heuristic but keep the heuristic
(obviously re-tuned!) for katakana?
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
> Key: LUCENE-3767
> URL: https://issues.apache.org/jira/browse/LUCENE-3767
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3767.patch, LUCENE-3767.patch, LUCENE-3767.patch,
> compound_diffs.txt
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]