Tomás The ICU code base is used by a *lot* so I think it is safe to say that it works ok :)
François On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote: > "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' > or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration > will miss results." > Exactly, that's my problem, searching on a different alphabet than the one > on which it was indexed a document. > François, thank you for your help. Have you used the new ICU Filters? Do > they work OK? (I know it doesn't do Kanji) > > Tomás > > 2011/3/11 François Schiettecatte <fschietteca...@gmail.com> > >> Good question about transliteration, the issue has to do with recall, for >> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana >> respectively), not doing the transliteration will miss results. You will >> find that the big search engines do the transliteration for you >> automatically. This issue get even more complicated when you dig into >> orthographic variation because Japanese orthography is very variable (ie >> there is more than one way to write a 'word'), as is tokenization (ie there >> is more than one way to tokenize it), see: >> >> http://www.cjk.org/cjk/reference/japvar.htm >> >> I have used the Basis Technology software in the past, it is very good, but >> it is also very expensive. >> >> François >> >> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote: >> >>> Why not index it as-is? Solr can handle Unicode. >>> >>> Transliterating hiragana to katakana is a very weird idea. I cannot >> imagine how that would help. >>> >>> You will need some sort of tokenization to find word boundaries. N-grams >> work OK for search, but are really ugly for highlighting. >>> >>> As far as I know, there are no good-quality free tokenizers for Japanese. >> Basis Technology sells Japanese support that works with Lucene and Solr. >>> >>> wunder >>> >>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote: >>> >>>> Tomás >>>> >>>> That wont really work, transliteration to Romaji works for individual >> terms only so you would need to tokenize the Japanese prior to >> transliteration. I am not sure what tool you plan to use for >> transliteration, I have used ICU in the past and from what I can tell it >> does not transliterates Kanji. Besides transliterating Kanji is debatable >> for a variety of reasons. >>>> >>>> What I would suggest is that you transliterate Hiragana to Katakana, >> leave the Kanji alone, and index/search using ngrams. If you want 'proper' >> tokenization I would recommend Mecab. >>>> >>>> I have looked into this for a client and there is no clear cut solution. >>>> >>>> Cheers >>>> >>>> François >>>> >>>> >>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote: >>>> >>>>> This question is probably not a completely Solr question but it's >> related to >>>>> it. I'm dealing with a Japanese Solr application in which I would like >> to be >>>>> able to search in any of the Japanese Alphabets. The content can also >> be in >>>>> any Japanese Alphabet. I've been thinking in this solution: Convert >>>>> everything to roma-ji, on Index time and query time. >>>>> For example: >>>>> >>>>> Indexing time: >>>>> [Something in Hiragana] --> translate it to roma-ji --> index >>>>> >>>>> Searching time: >>>>> [Something in Katakana] --> translate it to roma-ji --> search >>>>> or >>>>> [Something in Kanji] --> translate it to roma-ji --> search >>>>> >>>>> I don't have a deep understanding of Japanese, and that's my problem. >> Did >>>>> somebody in the list tried something like this before? Did it work? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Tomás >>>> >>> >>> -- >>> Walter Underwood >>> Venture ASM, Troop 14, Palo Alto >>> >>> >>> >> >>