Multiple Japanese Alphabets in Solr

2011-03-11 Thread Tomás Fernández Löbbe
This question is probably not a completely Solr question but it's related to it. I'm dealing with a Japanese Solr application in which I would like to be able to search in any of the Japanese Alphabets. The content can also be in any Japanese Alphabet. I've been thinking in this solution: Convert

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte
Tomás That wont really work, transliteration to Romaji works for individual terms only so you would need to tokenize the Japanese prior to transliteration. I am not sure what tool you plan to use for transliteration, I have used ICU in the past and from what I can tell it does not

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Walter Underwood
Why not index it as-is? Solr can handle Unicode. Transliterating hiragana to katakana is a very weird idea. I cannot imagine how that would help. You will need some sort of tokenization to find word boundaries. N-grams work OK for search, but are really ugly for highlighting. As far as I

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte
Good question about transliteration, the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results. You will find that the big search engines do the transliteration for you

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Tomás Fernández Löbbe
the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results. Exactly, that's my problem, searching on a different alphabet than the one on which it was indexed a document. François, thank

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Walter Underwood
Sounds more like generating synonyms than conflating everything to one set of kana. Why not a filter that does that transliteration and adds a token at the some position? wunder On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote: the issue has to do with recall, for example, I can

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte
Tomás The ICU code base is used by a *lot* so I think it is safe to say that it works ok :) François On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote: the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte
You could certainly do it that way if you wanted. The one point I would make here is that from a linguistic POV these are not synonyms but are the same term written in a different alphabet. François On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote: Sounds more like generating synonyms