verdy_p Sat, 28 Mar 2015 17:35:57 -0700

[Note: message resent using another domain. Visibly the Unicode mailing list 
rejects as spam all emails posted from Gmail's webmail, and containing all 
relevant tracking mime headers and 
regularly signed by Google and my proven identity].


2015-03-28 12:30 GMT+01:00 Michael Norton :
>
> Thanks Doug.  I did not know there exists a representative sample of the 
> world's text. :)
> I do know that 400 years ago there were about 10,000 languages; now there are 
> about 6,500.
> Time flies!  
>
> Your frequency chart is great.The average char appearance is 2.91%. Only 34% 
> from your list exceed 10% of it.
> Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far > 2.91%).
> In fact, it's almost >50% greater than the next most-appearing character.   
>
> So from the two frequency lists you've given me (my email and yours) we begin 
> to see some patterns emerge.
> Provided prior data and observation, most useful patterns prevail over other 
> more obscure ones
> and present a provocative opportunity for webbers out there...
 
> While this is probably out of context for most of the 700 Unicode members, I 
> can report that it's good news.

Long time ago I learned a "word" (or is it an acronym? it's not really an 
abbreviation by itself even if it is pronounceable) used by French 
cryptanalists (using simple encryption schemes by 
substitution): "ESARTINULOC" (some older sources gave "ESANTIRULO"). Which is 
the ordered list of most frequently basic letters used in French (ignoring case 
and diacritic differences). It's 
also used implicitly by gamers (e.g. playing or composing crosswords, or 
playing games such as Scrabble(TM), where the top letters of the list have 
lower scoring values, different between 
French Scrabble and English Scrabble).

That "word" is slightly different in English, or in the limited "global" 
counting Doug did (over an extremely limited set of source texts); but of 
course in French the SPACE would also lead the 
list before that "word" (but that does not enter into account for crosswords or 
Scrabble, even in languages that don't use spaces for word separation).

More accurate statistics may be found using statistics collected by databases 
with plain-text search capabilities (in the structure of their index), provided 
they correctly track the language used 
and their data concerns a large enough set of domains (e.g. statistics of 
plain-text search engines for each **localized** edition of Wikipedia, 
Wiktionnary, or Wikisource). If you want "global" 
statistics it will be more difficult (Wikimedia Commons is insufficiently 
translated, with a too wide presence of English), but what you may do is to 
estimate the rate of usages for each main 
language (or macrolanguage) and weight the statistics collected for each 
language to return an estimated "global" frequency list.

But be careful, each language has its own set of collation rules such that 
letters that are considered having the same primary weight in one language are 
distinguished and counted separately 
in some other language: you may find that a source "ü" or "ä" had its rate 
actuelly computed as "UE" or "AE" in German, but only as "U" or "A" in English 
or French, and this wil not allow you 
to correctly estimate the global frequency rates of "U", "A" and "E". A simple 
linear mathematic transform (scalar products of usage rates of languages and 
usage rates of letters per 
language) would not work: the global usage rate of "E" would be underestimated 
where it also represents the German umlaut, and both "U" and "A" would be 
overestimated...

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Reply via email to