Edward H Trager wrote: > [...] > If I were going to write such an algorithm, I would: > > * First, insure that the incoming text stream to be classified was > sufficiently long to be probabilistically classifiable. In other > words, what's the shortest stream of Hanzi characters needed, on > average, in a typical Chinese text (on the web, for example) in > order to encounter at least one "ge" u+500B or u+4E2A? One "wei" > u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take > long to figure this out.
Lucky man! I was discussing about a similar subject just yesterday, and someone came up with this link: http://lingua.mtsu.edu/chinese-computing/statistics/ The figures in file <total.html> make it easy to answer your question: in a typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%, etc. _ Marco