Why not index it as-is? Solr can handle Unicode.

Transliterating hiragana to katakana is a very weird idea. I cannot imagine how 
that would help.

You will need some sort of tokenization to find word boundaries. N-grams work 
OK for search, but are really ugly for highlighting.

As far as I know, there are no good-quality free tokenizers for Japanese. Basis 
Technology sells Japanese support that works with Lucene and Solr.

wunder

On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:

> Tomás
> 
> That wont really work, transliteration to Romaji works for individual terms 
> only so you would need to tokenize the Japanese prior to transliteration. I 
> am not sure what tool you plan to use for transliteration, I have used ICU in 
> the past and from what I can tell it does not transliterates Kanji. Besides 
> transliterating Kanji is debatable for a variety of reasons.
> 
> What I would suggest is that you transliterate Hiragana to Katakana, leave 
> the Kanji alone, and index/search using ngrams. If you want 'proper' 
> tokenization I would recommend Mecab.
> 
> I have looked into this for a client and there is no clear cut solution.
> 
> Cheers
> 
> François
> 
> 
> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
> 
>> This question is probably not a completely Solr question but it's related to
>> it. I'm dealing with a Japanese Solr application in which I would like to be
>> able to search in any of the Japanese Alphabets. The content can also be in
>> any Japanese Alphabet. I've been thinking in this solution: Convert
>> everything to roma-ji, on Index time and query time.
>> For example:
>> 
>> Indexing time:
>> [Something in Hiragana] --> translate it to roma-ji --> index
>> 
>> Searching time:
>> [Something in Katakana] --> translate it to roma-ji --> search
>> or
>> [Something in Kanji] --> translate it to roma-ji --> search
>> 
>> I don't have a deep understanding of Japanese, and that's my problem. Did
>> somebody in the list tried something like this before? Did it work?
>> 
>> 
>> Thanks,
>> 
>> Tomás
> 

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto



Reply via email to