Why not index it as-is? Solr can handle Unicode. Transliterating hiragana to katakana is a very weird idea. I cannot imagine how that would help.
You will need some sort of tokenization to find word boundaries. N-grams work OK for search, but are really ugly for highlighting. As far as I know, there are no good-quality free tokenizers for Japanese. Basis Technology sells Japanese support that works with Lucene and Solr. wunder On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote: > Tomás > > That wont really work, transliteration to Romaji works for individual terms > only so you would need to tokenize the Japanese prior to transliteration. I > am not sure what tool you plan to use for transliteration, I have used ICU in > the past and from what I can tell it does not transliterates Kanji. Besides > transliterating Kanji is debatable for a variety of reasons. > > What I would suggest is that you transliterate Hiragana to Katakana, leave > the Kanji alone, and index/search using ngrams. If you want 'proper' > tokenization I would recommend Mecab. > > I have looked into this for a client and there is no clear cut solution. > > Cheers > > François > > > On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote: > >> This question is probably not a completely Solr question but it's related to >> it. I'm dealing with a Japanese Solr application in which I would like to be >> able to search in any of the Japanese Alphabets. The content can also be in >> any Japanese Alphabet. I've been thinking in this solution: Convert >> everything to roma-ji, on Index time and query time. >> For example: >> >> Indexing time: >> [Something in Hiragana] --> translate it to roma-ji --> index >> >> Searching time: >> [Something in Katakana] --> translate it to roma-ji --> search >> or >> [Something in Kanji] --> translate it to roma-ji --> search >> >> I don't have a deep understanding of Japanese, and that's my problem. Did >> somebody in the list tried something like this before? Did it work? >> >> >> Thanks, >> >> Tomás > -- Walter Underwood Venture ASM, Troop 14, Palo Alto