Chinese to Pinyin transliteration : homophone matching

Catala, Francois Mon, 10 Jun 2013 10:26:17 -0700

Hi,

I've been looking for ways to do homophone matching in Solr for CJK languages. 
I am digging into Chinese for a start.
My inputs are words made of simplified characters, and I need to match words 
that use different characters, but are pronounced the same way.


My conclusion is that I need to index all the possible pinyin representations 
for a given word. Then at query time, generate all pinyin representations for 
the searched word, and match all documents containing any one of them.

My question is : which components can do that in Solr? I've been looking at 
ICUTokenFilterFactory, but with id="Han-Latin" it seems to to do a 1 to 1 
mapping, between characters and pinyin, while in reality it should be a 1 to 
many mapping.

Do you know of any Analyzer that could do something like :


-       input :
长


-       output :
cháng | zhǎng | zháng


Thanks so much for your help!

Chinese to Pinyin transliteration : homophone matching

Reply via email to