On 8/20/2012 12:04 AM, Manuel Strehl wrote:
Thanks for the answer.

It's clear to me, that I could map "Hana" and "Kata" to "US" just for
the sake of having a Japanese Minority in the states. Of course, the
mapping must be sensible in a way, that is, explain, how the mapping
is done. I'd be fine, I guess, with having all official languages and
important historic ones respected (disputable cases, where larger
minority languages are suppressed, may exist of course).

Basically I'm looking for an n:m chart with ISO-639 on the left, and
ISO-15924 on the right. When the data itself is annotated with "used
by 0.2% of population" or "historic" that's all the better, because
then I could define my own cut-off limit. When there is only a prose
explanation of how the data was accumulated, I could judge, if the set
suits the task.

When there is no such data set whatsoever, I'll be off to scrape
Wikipedia again, but that is, as I've written, not an effective or
particularly error-free approach.

There are other sources than wikipedia.

I think what you are engaging in here is a bit of original research, in other words, you may be the first to try to put together this particular data set.

The usual statistics work off the number of "speakers" of languages, not which script they are written in (few languages are routinely written in either one or some other script simultaneously, and if so, usually the division is by territory).

So you might make a map from language to script first (allowing some 1:n and allowing local differences in that map). Then you can plug in statistics on language use. There are many sources you can use, for the US, see http://www.mla.org/census_main/

A map would be more interesting if you could find a way to split larger territories, such as the US, Russia, China, India, etc. into some suitable subdivisions. Notice how the language map for the US shows non-English languages nicely concentrated along the coast and borders.

A./



Cheers,
Manuel

2012/8/20 Asmus Freytag <asm...@ix.netcom.com>:
On 8/19/2012 4:05 PM, Manuel Strehl wrote:

Hello,

I'm looking for a data source, that maps countries to scripts used in
them. The target application is a visualization in the context of my
codepoints.net site, namely http://codepoints.net/scripts.

At the moment I've extracted the prefered scripts from CLDR (e.g., Cyrl
for Russia, Latn for Germany and so on). Then I've added some historic
scripts by looking at corresponding Wikipedia articles and did some
manual updating. However, this yields a not really satisfactory result.

For example, Russia has only Cyrl associated, while, as far as I can
tell, at least Latn and Arab should also be mentioned, also perhaps some
historic scripts.

I'd appreciate any pointers if and where I could find data sets that aid
me in completing and error-proofing this mapping.

Cheers,
Manuel


Heck, my utility bill in the US has Thai and Chinese characters (for the
fine print, not the statement itself). There's one more script, could be
Cyrillic, don't have one in front of me right now. In some areas of town
you'll find a mixture of scripts on shop signs as well.

The point it's easy to identify a majority script, but to get an accurate
handle on "other" scripts is going to be tricky, if not impossible. And it
all depends on your arbitrary decision of what other scripts to include and
on what basis.

A./



Reply via email to