Re: Searching data: map countries to scripts

Asmus Freytag Mon, 20 Aug 2012 09:33:46 -0700

On 8/20/2012 12:04 AM, Manuel Strehl wrote:

Thanks for the answer.


It's clear to me, that I could map "Hana" and "Kata" to "US" just for
the sake of having a Japanese Minority in the states. Of course, the
mapping must be sensible in a way, that is, explain, how the mapping
is done. I'd be fine, I guess, with having all official languages and
important historic ones respected (disputable cases, where larger
minority languages are suppressed, may exist of course).

Basically I'm looking for an n:m chart with ISO-639 on the left, and
ISO-15924 on the right. When the data itself is annotated with "used
by 0.2% of population" or "historic" that's all the better, because
then I could define my own cut-off limit. When there is only a prose
explanation of how the data was accumulated, I could judge, if the set
suits the task.

When there is no such data set whatsoever, I'll be off to scrape
Wikipedia again, but that is, as I've written, not an effective or
particularly error-free approach.


There are other sources than wikipedia.

I think what you are engaging in here is a bit of original research, inother words, you may be the first to try to put together this particulardata set.

The usual statistics work off the number of "speakers" of languages, notwhich script they are written in (few languages are routinely written ineither one or some other script simultaneously, and if so, usually thedivision is by territory).

So you might make a map from language to script first (allowing some 1:nand allowing local differences in that map). Then you can plug instatistics on language use. There are many sources you can use, for theUS, see http://www.mla.org/census_main/

A map would be more interesting if you could find a way to split largerterritories, such as the US, Russia, China, India, etc. into somesuitable subdivisions. Notice how the language map for the US showsnon-English languages nicely concentrated along the coast and borders.

A./


Cheers,
Manuel

2012/8/20 Asmus Freytag <asm...@ix.netcom.com>:

On 8/19/2012 4:05 PM, Manuel Strehl wrote:

Hello,

I'm looking for a data source, that maps countries to scripts used in
them. The target application is a visualization in the context of my
codepoints.net site, namely http://codepoints.net/scripts.

At the moment I've extracted the prefered scripts from CLDR (e.g., Cyrl
for Russia, Latn for Germany and so on). Then I've added some historic
scripts by looking at corresponding Wikipedia articles and did some
manual updating. However, this yields a not really satisfactory result.

For example, Russia has only Cyrl associated, while, as far as I can
tell, at least Latn and Arab should also be mentioned, also perhaps some
historic scripts.

I'd appreciate any pointers if and where I could find data sets that aid
me in completing and error-proofing this mapping.

Cheers,
Manuel


Heck, my utility bill in the US has Thai and Chinese characters (for the
fine print, not the statement itself). There's one more script, could be
Cyrillic, don't have one in front of me right now. In some areas of town
you'll find a mixture of scripts on shop signs as well.

The point it's easy to identify a majority script, but to get an accurate
handle on "other" scripts is going to be tricky, if not impossible. And it
all depends on your arbitrary decision of what other scripts to include and
on what basis.

A./

Re: Searching data: map countries to scripts

Reply via email to