[ https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194509#comment-13194509 ]
Christian Moen commented on LUCENE-3726: ---------------------------------------- These are very interesting questions, Robert. Please find my comments below. {quote} should these parameters continue to be static-final, or configurable? {quote} It's perhaps possible to make these configurable, but I think we'd be exposing configuration that is most likely to confuse most users rather than help them. The values currently uses have been found using some analysis and experimentation, and they can probably be improved both in terms of tuning and with added heuristics -- in particular for katakana compounds (more below). However, changing and improving this requires quite detailed analysis and testing, though. I think the major case for exposing them is as a means for easily tuning them rather than these parameters being generally useful to users. {quote} should POS also play a role in the algorithm (can/should we refine exactly what we decompound)? {quote} Very good question and an interesting idea. In the case of long kanji words such as 関西国際空港 (Kansai International Airport), which is a known noun, we can possible use POS info as a hint for applying the Viterbi penalty. In the case of unknown kanji, Kuromoji unigrams them. (関西国際空港 becomes 関西 国際 空港 (Kansai International Airport) using search mode.) Katakana compounds such as シニアソフトウェアエンジニア (senior software engineer) becomes one token without search mode, but when search mode is used, we get three tokens シニア ソフトウェア エンジニア as you would expect. It's also the case that シニアソフトウェアエンジニア is an unknown word, but its constituents become known and get the correct POS after search mode. In general, unknown words get a noun-POS (名詞) so the idea of using POS here should be fine. There are some problems with the katakana decompounding in search mode. For example, コニカミノルタホールディングス (Konika Minolta Holdings) becomes コニカ ミノルタ ホール ディングス (Konika Minolta horu dings), where we get the additional token ホール (also means hall, in Japanese). To sum up, I think we can potentially use the noun-POS as a hint when doing the decompounding in search mode, but I'm not sure how much we will benefit from it, but I like the idea. I think we'll benefit most from an improved heuristic for non-kanji to improve katakana decompounding. Let me have a tinker and see how I can improve this. {quote} is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both? {quote} Interesting idea and good point regarding IDF. In order do the decompoundning, we'll need access to the lattice and add entries to it before we run the Viterbi. If we do normal segmentation first then run a decompounding filter, I think we'll need to run the Viterbi twice in order to get the desired results. (Optimizations are possible, though.) I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface. We can potentially have something like {code:title=Token.java} /** * Returns a list of possible decompounds for this token found by a heuristic * * @return a list of candidate decompounds or null of none is found */ List<Token> getDecompounds() { // ... } {code} In the case of シニアソフトウェアエンジニア, the current token would have surface form シニアソフトウェアエンジニア, but with tokens シニア, ソフトウェア and エンジニア accessible using {{getDecompounds()}}. As a general notice, I should point our that how well the heuristics performs depends on the dictionary/statistical model used (i.e. IPADIC) and if we might want to make different heuristics for each of those we support as needed. > Default KuromojiAnalyzer to use search mode > ------------------------------------------- > > Key: LUCENE-3726 > URL: https://issues.apache.org/jira/browse/LUCENE-3726 > Project: Lucene - Java > Issue Type: Improvement > Affects Versions: 3.6, 4.0 > Reporter: Robert Muir > > Kuromoji supports an option to segment text in a way more suitable for search, > by preventing long compound nouns as indexing terms. > In general 'how you segment' can be important depending on the application > (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this > in chinese) > The current algorithm punishes the cost based on some parameters > (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc) > for long runs of kanji. > Some questions (these can be separate future issues if any useful ideas come > out): > * should these parameters continue to be static-final, or configurable? > * should POS also play a role in the algorithm (can/should we refine exactly > what we decompound)? > * is the Tokenizer the best place to do this, or should we do it in a > tokenfilter? or both? > with a tokenfilter, one idea would be to also preserve the original > indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0) > from my understanding this tends to help with noun compounds in other > languages, because IDF of the original term boosts 'exact' compound matches. > but does a tokenfilter provide the segmenter enough 'context' to do this > properly? > Either way, I think as a start we should turn on what we have by default: its > likely a very easy win. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org