Re: Multiple Japanese Alphabets in Solr

François Schiettecatte Fri, 11 Mar 2011 11:19:21 -0800

Tomás

The ICU code base is used by a *lot* so I think it is safe to say that it works 
ok :)


François

On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote:

> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
> will miss results."
> Exactly, that's my problem, searching on a different alphabet than the one
> on which it was indexed a document.
> François, thank you for your help. Have you used the new ICU Filters? Do
> they work OK? (I know it doesn't do Kanji)
> 
> Tomás
> 
> 2011/3/11 François Schiettecatte <fschietteca...@gmail.com>
> 
>> Good question about transliteration, the issue has to do with recall, for
>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
>> respectively), not doing the transliteration will miss results. You will
>> find that the big search engines do the transliteration for you
>> automatically. This issue get even more complicated when you dig into
>> orthographic variation because Japanese orthography is very variable (ie
>> there is more than one way to write a 'word'), as is tokenization (ie there
>> is more than one way to tokenize it), see:
>> 
>>       http://www.cjk.org/cjk/reference/japvar.htm
>> 
>> I have used the Basis Technology software in the past, it is very good, but
>> it is also very expensive.
>> 
>> François
>> 
>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>> 
>>> Why not index it as-is? Solr can handle Unicode.
>>> 
>>> Transliterating hiragana to katakana is a very weird idea. I cannot
>> imagine how that would help.
>>> 
>>> You will need some sort of tokenization to find word boundaries. N-grams
>> work OK for search, but are really ugly for highlighting.
>>> 
>>> As far as I know, there are no good-quality free tokenizers for Japanese.
>> Basis Technology sells Japanese support that works with Lucene and Solr.
>>> 
>>> wunder
>>> 
>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
>>> 
>>>> Tomás
>>>> 
>>>> That wont really work, transliteration to Romaji works for individual
>> terms only so you would need to tokenize the Japanese prior to
>> transliteration. I am not sure what tool you plan to use for
>> transliteration, I have used ICU in the past and from what I can tell it
>> does not transliterates Kanji. Besides transliterating Kanji is debatable
>> for a variety of reasons.
>>>> 
>>>> What I would suggest is that you transliterate Hiragana to Katakana,
>> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
>> tokenization I would recommend Mecab.
>>>> 
>>>> I have looked into this for a client and there is no clear cut solution.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> 
>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>>>> 
>>>>> This question is probably not a completely Solr question but it's
>> related to
>>>>> it. I'm dealing with a Japanese Solr application in which I would like
>> to be
>>>>> able to search in any of the Japanese Alphabets. The content can also
>> be in
>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>>>> everything to roma-ji, on Index time and query time.
>>>>> For example:
>>>>> 
>>>>> Indexing time:
>>>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>>>> 
>>>>> Searching time:
>>>>> [Something in Katakana] --> translate it to roma-ji --> search
>>>>> or
>>>>> [Something in Kanji] --> translate it to roma-ji --> search
>>>>> 
>>>>> I don't have a deep understanding of Japanese, and that's my problem.
>> Did
>>>>> somebody in the list tried something like this before? Did it work?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Tomás
>>>> 
>>> 
>>> --
>>> Walter Underwood
>>> Venture ASM, Troop 14, Palo Alto
>>> 
>>> 
>>> 
>> 
>>

Re: Multiple Japanese Alphabets in Solr

Reply via email to