Edward H Trager wrote:
> [...]
> If I were going to write such an algorithm, I would:
> 
>  * First, insure that the incoming text stream to be classified was
>    sufficiently long to be probabilistically classifiable.  In other
>    words, what's the shortest stream of Hanzi characters needed, on
>    average, in a typical Chinese text (on the web, for  example) in
>    order to encounter at least one "ge" u+500B or u+4E2A? One "wei" 
>    u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take
>    long to figure this out.

Lucky man! I was discussing about a similar subject just yesterday, and
someone came up with this link:

        http://lingua.mtsu.edu/chinese-computing/statistics/

The figures in file <total.html> make it easy to answer your question: in a
typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%,
etc.

_ Marco

Reply via email to