You'll see katakana used with kanji in noun compounds where one of the words is 
foreign.

In Japanese, "Rice University" is not written with the kanji word for "rice". 
They use katakana for "rice" and kanji for "university", like this: ライス大学.

This is very common. I expect that "President Obama" uses kanji for the title 
and katakana for "Obama".

Removing hiragana is a bad idea. There are some words that are only written in 
hiragana.

wunder

On Apr 30, 2012, at 1:27 PM, Burton-West, Tom wrote:

> Thanks wunder and Lance,
> 
> In the discussions I've seen of Japanese IR in the English language IR 
> literature, Hiragana is either removed or strings are segmented first by 
> character class.  I'm interested in finding out more about why bigramming 
> across classes is desirable.
> Based on my limited understanding of Japanese, I can see how perhaps 
> bigramming a Han and Hiragana character might make sense but what about Han 
> and Katakana?
> 
> Lance, how did you weight the unigram vs bigram fields for CJK? or did you 
> just OR them together assuming that idf will give the bigrams more weight?
> 
> Tom
> 



Reply via email to