Re: Ethnologue

2000-09-13 Thread Michael Everson

Ar 09:19 -0800 2000-09-12, scríobh [EMAIL PROTECTED]:

>First, by the definitions assumed in the Ethnologue, they are all
>considered to be distinct languages; they would be candidates for separate
>literacy and literature development (if currently spoken-only), and if
>literature were to be developed, then they would need to be distinguished
>for IT-processing purposes such as spell-checking.

Yes, well, it is these definitions and how they are implemented which we
question.

>Thirdly, there are a large number of users looking for a complete set of
>tags. This includes people in the specialised fields of linguistics and
>anthropology, but it also includes governments, development organisations,
>and many businesses in the IT industry.

Peter, why has the Maintenance Agency for ISO 639 never heard from any of
these people, if governments are looking for standardized tags and everyone
else is looking for tags, why haven't they contacted 639 or 1766 to inform
us of the kinds of requirements they have?

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 06:37:25 AM Michael Everson wrote:

>>First, by the definitions assumed in the Ethnologue, they are all
>>considered to be distinct languages; they would be candidates for
separate
>>literacy and literature development (if currently spoken-only), and if
>>literature were to be developed, then they would need to be distinguished
>>for IT-processing purposes such as spell-checking.
>
>Yes, well, it is these definitions and how they are implemented which we
>question.

But the only questions raised about the definitions and their
implementation have been based on anecdotal evidence, or on attested
instances of error in the results, which certainly exist but which do not
logically invalidate the definitions.



>>Thirdly, there are a large number of users looking for a complete set of
>>tags. This includes people in the specialised fields of linguistics and
>>anthropology, but it also includes governments, development
organisations,
>>and many businesses in the IT industry.
>
>Peter, why has the Maintenance Agency for ISO 639 never heard from any of
>these people, if governments are looking for standardized tags and
everyone
>else is looking for tags, why haven't they contacted 639 or 1766 to inform
>us of the kinds of requirements they have?

I can't answer for all these users. For example, I have no way of knowing
why UNESCO consulted SIL on enumerating the world's languages rather than
ISO. My understanding is that some have attempted to make their needs
known, but for whatever reason have not succeeded in getting the reponse
they hoped for. Perhaps there is a perception that ISO is unresponsive
leading people not to make their requests. Perhaps the Maintenance Agency
*is*, in fact, unresponsive. That's how you've been coming across in these
discussions: rather than saying, "I recognise the need, but have some
concerns about some details, so lets investigate how we can find the best
all-around solutions," your response has been, "I am not interested in
considering the list of languages enumerated in the Ethnologue."

There may be other reasons I don't know about. As for SIL, we have not made
any request before now since (a) ISO 639 so obviously and thoroughly fell
short of what we needed that there was no indication we could expect the
kind of support we need; (b) our efforts at enumerating languages for IT
purposes have a rather longer history than does ISO 639-x, and we have up
to now met our own needs - that's why the Ethnologue exists in the first
place. We are only now beginning to broach this topic with IT standards
bodies for these reasons:

1. Others keep coming to us asking us to do this.

2. We want to engage with other agencies in developing distributed archives
of online, linguistic data, and want to conform to industry standards like
XML (and, thus, RFC 1766 or its successor) in order to ensure good
documentation and interoperability.

3. We are increasingly involved in partnership with outside agencies like
UNESCO or government agencies for whom standardisation is deemed important.

If we only needed to maintain our own data internally, it would be a whole
lot easier for us to do this ourselves. It's the interaction with those
outside SIL that is pushing us to pursue more. Whether results are achieved
through ISO 639-x or through some successor to RFC1766 doesn't really
matter a lot, as long as the users involved are happy that there is a
de-facto standard.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-12 Thread Michael Everson

I thnk there are codes given to entities in the Ethnologue list that aren't
languages in the sense that we need to identify languages in IT and in
Bibliography (which is what the codes are for). I think that it is not
mature for International Standardization. It is a work in progress, subject
to change. As such it is a living document.

I don't see what the hurry is. Make a list of 100 languages that you *need*
codes for urgently. Make a list of another 100 after that. Encode languages
that you *really* need codes for. That's what I mean by saying "just
because it's in the list doesn't mean it should get a code".

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: the Ethnologue

2000-09-12 Thread Rick McGowan

Oh Michael...

> I think there are codes given to entities in the Ethnologue list that
> aren't languages in the sense that we need to identify languages in IT
> and in Bibliography

ISO 639, and every other "standard" for language/locale codes also has this problem, 
and from what I remember of the last old version of Ethnologue that I looked at in 
great detail... the Ethnologue database has them in smaller quantity than other 
standards.

> I don't see what the hurry is.

Can you say "Library of Congresss"?

One of the arguments that representatives of the U. S. Government keep making to 
people who attend their shows at the IUC conferences each year... is that nobody knows 
where the next earthquake, flood, war, famine, bomb, or other disaster, be it 
political or natural, is going to occur.  And when there is an "incident", language 
data and translations and many things are needed IMMEDIATELY, not in ten years or more 
when ISO wakes up.

One of the major PROBLEMS with ISO 639, and other such lists developed by ISO over the 
years, is that they are not brought into being, or maintained, with the intent of 
being comprehensive.  They are either intended to, or do serve, some short-term narrow 
interests.

Governments, libraries, and businesses throughout the world have needed a 
comprehensive language and locale identification system for many years.  ISO has not 
provided it.  One place to start is with a comprehensive list of "languages" -- 
however you define that; and please define it at least with fair consistency.  The 
Ethnologue is a place to start.

Can anyone point me to an existing list of languages that is more comprehensive and 
better researched than the Ethnologue?  If there is no such list, then we don't need 
to consider any alternatives, right?

Rick


 


RE: the Ethnologue

2000-09-12 Thread Murray Sargent

Rick asks, 
>>Can anyone point me to an existing list of languages that is more
comprehensive and better
>> researched than the Ethnologue?  If there is no such list, then we don't
need to consider any
>>alternatives, right?

I've heard that the Ethnologue deals only with currently spoken languages
and doesn't provide codes that distinguish between dialects. It would be
nice to have a more general list of language codes.  It's important for
spell checking to distinguish between, say, British and American English.
The Ethnologue describes some such differences in text, but doesn't appear
to provide a corresponding list of secondary language codes (pls correct me
if I'm wrong).

Thanks
Murray



RE: the Ethnologue

2000-09-12 Thread Rick McGowan

Murray wrote:

> I've heard that the Ethnologue deals only with currently spoken languages 
> and doesn't provide codes that distinguish between dialects. It would be 
> nice to have a more general list of language codes.  It's important for 
> spell checking to distinguish between, say, British and American English. 
> The Ethnologue describes some such differences in text, but doesn't appear 
> to provide a corresponding list of secondary language codes (pls correct me 
> if I'm wrong). 

Looked at from the perspective of "locale" tagging like that, it does have 
shortcomings.  But it's still more comprehensive than other lists.  If you use that as 
the basis for a "basic" tag, you can always add a sub-language name space or other 
bits of hierarchy that make it as comprehensive as you want for purposes of use by 
computers.  Obviously also it would have to eventually be expanded to include dead 
languages.

I view the Ethnologue list as being the best candidate to START with; and around that, 
we would build a SYSTEM of identification that is as comprehensive and fine-grained as 
necessary for particular domains.

Rick


 


Re: the Ethnologue

2000-09-12 Thread Christopher J. Fynn


> Can anyone point me to an existing list of languages that is more
> comprehensive and better researched than the Ethnologue?  
> If there is no such list, then we don't need to consider any 
> alternatives, right?

I'm not qualified to judge the merits of one list over another
but there certaily are other comprehensive and well researched 
lists e.g. the Linguasphere Register of the World's Languages 
and Speech Communities see:  http://www.linguasphere.org/

Unfortunately their list is not available online, you have to buy 
the book - a bit like ISO/IEC 10646 and many other standards 
:-) 

I do know that the way the compilers of the Linguasphere have 
classified languages and dialects is different than the way the
compilers of the Ethnolouge have - though I'm sure both could
give you well reasoned arguments why their scheme is better
or more useful than the other. 

- Chris 




Re: the Ethnologue

2000-09-12 Thread Peter_Constable


On 09/12/2000 08:08:14 PM "Christopher J. Fynn" wrote:

>I'm not qualified to judge the merits of one list over another
>but there certaily are other comprehensive and well researched
>lists e.g. the Linguasphere Register of the World's Languages
>and Speech Communities see:  http://www.linguasphere.org/
>
>Unfortunately their list is not available online, you have to buy
>the book - a bit like ISO/IEC 10646 and many other standards
>:-)
>
>I do know that the way the compilers of the Linguasphere have
>classified languages and dialects is different than the way the
>compilers of the Ethnolouge have - though I'm sure both could
>give you well reasoned arguments why their scheme is better
>or more useful than the other.

I think the Linguasphere is a valueable publication, and the only
alternative I'm aware of that is a contender in place of the Ethnologue. My
concerns about it are:

- As Chris mentioned, the info isn't available online. I consider the
availability of online documentation to back up a set of codes to be
essential. Otherwise, there is no easy way for users to find out what
things mean.

- The Linguasphere uses a hierarchical system that begins with 10 divisions
in each of 10 major regions. This was done specifically to avoid questions
about higher-level genetic relationships, but the divisions end up being
rather arbitrary. The languages of the world do not in fact neatly divide
into 10 major groups in each of 10 major regions.

- There is a multi-level hierarchy that begins at levels above what the
Ethnologue considers to be a language, and goes below that level. There is
no certainty that one category in one place within the Linguasphere catalog
that is at a given level represents exactly the same kind of object as
other categories at the same level elsewhere in the catalog. Also, it is
not clear which of these levels are or are not useful for the purposes of
language-specific processing.

In contrast, it is our experience that the categories reflected in the
Ethnologue are the most generally useful for language-specific processing.
There are some exceptions to this (e.g. Murray Sargent pointed out that
there are regional-variant spelling conventions for English), but these are
the exception rather than the norm. Note also that something like spell
checking involves a *paralinguistic* notion, viz. spelling/orthographic
conventions, rather than the notion of *language* itself. There are clearly
cases of language-specific processing which will need to rely on some
paralinguistic notion such as "spelling/orthographic convention" or
"writing system". On the one hand, this area is not yet well enough
understood to come up with comprehensive enumerations of identifiers for
these various purposes. Secondly, identifiers that are appropriate purposes
will generally build from a set of *language* identifiers as a starting
point. (E.g. if you're going to enumerate writing systems, you'll need to
begin with an enumeration of languages.) As Rick responded to Murray,
Ethnologue codes don't solve all problems, but they do give us a
comprehensive list of modern languages that represents a good starting
point from which to work.

So, for these three reasons, I don't think the Linguasphere is as good a
choice for language identifiers for IT purposes. It would be useful for
documenting what identifiers within some system of identifiers denote,
except that the information is not available online.

Some are of the opinion that a hierarchical system is needed. A few people
at IUC17 commented that Ethnologue codes should be supplemented in this
way. Two comments:

1. Someone in the discussion time pointed out that there are many possible
alternate hierarchies based on orthogonal factors (e.g. inferred genetic
relationship, historical connections, geographic proximity, linguistic
similarity, related writing traditions, ...). It would be impossible to
have a single hierarchy that does all of this. (One further comment about
Linguasphere: I haven't read all of the introductory material, but there is
an indication that the choice was made to *not* base the hierarchy on
inferred genetic relationships since this was not considered relevant for
understanding the current socio-linguistic settings of language
communities. That raises the question of just what basis Linguasphere's
hierarchy *is* built on - it's not clear to me what this is.)

2. I don't think there is a clear understanding of what purposes
hierarchical categories would serve. Certainly a hierarchical,
non-leaf-node category can be useful for subject indexing (e.g. to find any
materials about Uto-Aztecan languages), but I don't think it's clear what
other useful purpose such a category would serve. I think it would be
better that identifiers for subject catalogs *not* get mixed up with
identifiers for langua

Re: the Ethnologue

2000-09-12 Thread J%ORG KNAPPEN

Rick McGowan asked:

> Can anyone point me to an existing list of languages that is more =
> comprehensive and better researched than the Ethnologue?  If there is no =
> such list, then we don't need to consider any alternatives, right?

Ask the closest university department of comparative linuguistics, and you will 
receive quite impressive lists. As a starter, 
David Crystall's Cambridge Encyclopedia of Language contains a good list 
of languages in one of its appendices.

I once looked at the ethnologue and its subdivision of the german language
is just ridiculous. Not small errors, a gross misconception. I don't trust
the ethnologue in area where I don't know the fact well, since it fails in one
area where I know them.

--J"org Knappen




Re: the Ethnologue

2000-09-13 Thread John Hudson

Rick McGowan wrote:

>One of the major PROBLEMS with ISO 639, and other such lists developed by
ISO over the years, is that they are not brought into being, or maintained,
with the intent of being comprehensive.  They are either intended to, or do
serve, some short-term narrow interests.

>Governments, libraries, and businesses throughout the world have needed a
comprehensive language and locale identification system for many years.
ISO has not provided it.  One place to start is with a comprehensive list
of "languages" -- however you define that; and please define it at least
with fair consistency.  The Ethnologue is a place to start.

>Can anyone point me to an existing list of languages that is more
comprehensive and better researched than the Ethnologue?  If there is no
such list, then we don't need to consider any alternatives, right?

I agree with everything Rick has said except his conclusions. As I
suggested to Peter Constable after his presentation at the Unicode
conference, the first task should _not_ be to populate any standards with
Enthologue codes or, of that matter, any other set of codes. The first
tasks should be to a) identify the different kinds of information that need
to be represented by tags (spoken languages, written languages, literary
languages (not the same thing as a written languages), particular
orthographies, language-specific script variants, ?, ?) and then b)
identify appropriate existing standards (if any actually exist) or develop
new standards to contain these tags. At the same time, the scope of these
standards should be clearly identified and rules introduced to govern the
addition of future tags (the kind of rules that don't result in a standard
containing codes for both individual languages and language groups).
Without such an approach, any new standard work will be plagued with
exactly the kind of inconsistencies that make both ISO 639 and the
Ethnologue of dubious merit for IT purposes.

This strikes me as a much more useful direction than trying to shove new
tags into already inconsistent standards that were originally designed for
other purposes. It is also a lot more useful than, for instance, trying to
forcefully align OpenType LangSys tags with ISO 639 codes, as has been
suggested simply because the latter is a STANDARD, when it is far from
clear that the two indicate the same kind of information.

John Hudson

Tiro Typeworks  A man was meant to be doubtful about
Vancouver, BC   himself, but undoubting about the truth;
www.tiro.comthis has been exactly reversed.
[EMAIL PROTECTED]   G.K. Chesterton



Re: the Ethnologue

2000-09-13 Thread Michael Everson

Ar 23:56 +0100 2000-09-12, scríobh Christopher J. Fynn:

>A lot of what are listed as "languages" in the Ethnologue are what most people
>would call dialects. For instance almost every known dialect of spoken Tibetan
>is listed as a separate language in the Ethnolouge although they all share
>only one written form.

YES. This is one of the serious problems of the list.

Ar 22:39 -0800 2000-09-12, scríobh Jörg Knappen:

>I once looked at the ethnologue and its subdivision of the german language
>is just ridiculous. Not small errors, a gross misconception. I don't trust
>the ethnologue in area where I don't know the fact well, since it fails in one
>area where I know them.

YES. This is one of the serious problems of the list.

If SIL has 2000 real languages they need codes for in real applications,
then those 2000 (which is a lot) should be proposed to 639 or 1766. That's
what 639 and 1766 are for. It would be nice to know what the applications
are.

I do not think we should adopt all 6000 codes from the Ethnologue as
"language tags". I am, frankly, shocked that linguists should consider
doing so so uncritically.

Or what, Ken? abandon the international standards and freeze the
Ethnologue, warts and all, and just vacuum up all its entities and tell the
world, use these tags? Or do a proper job of review of real requirements?

The Ethnologue itself wasn't designed for the IT purposes everyone seems to
be clamouring for, either, as far as I know. And if it were accepted as-is,
then it couldn't be revised, right?

More haste, less speed, people. Do you need a code for German? Yes. Do you
need a code for Manx? Yes. Though the communities differ vastly in size,
their IT reqirements are quite similar.

Do you need a code for !Xóõ? May I ask what for? There are (according to
the Ethnologue) 3000-4000 speakers. According to Anthony Traill's _A !Xóõ
Dictionary_, "!Xóõ is an unwritten language and its speakers have no notion
of linguistic standardization". Well, honestly, whose IT requirements are
you going to serve? (All the characters used in the dictionary are in the
UCS. I checked, because that *is* a real and important requirement.)

Do you really need 8 codes for "the German languages"? How many "Tibetans"
are there? Is Samvedi a language? It "shares many features with Gujarati.
Survey needed". How many times does "survey needed" appear in the
Etnologue? How many of them aren't really languages (in the sense that we
need to implement for IT and libraries (which use IT by the way)) but only
preliminary studies? HOW DO WE KNOW?

The Ethnologue says there are 6000 speakers of Shelta in Ireland, 50,000 in
the US, and 30,000 in the UK. That's 86,000 speakers?! The Ethnologue says
that Shelta is Indo-European:Celtic:Insular:Goidelic, which it isn't. It
names Hancock 1990 as the source of this (impossibly incorrect)
information. In the bibliography there is no Hancock 1990.

This is just another error, and for a language in Western Europe. I do not
believe that the Ethnologue can be taken so uncritically.

There may be problems with 639 and 1766 but the committees in question have
been addressing these recently so that we can make and maintain more
effective and responsive standards. Has that all been wasted effort? IT
industry can circumvent the standards easily if it wants to. Is that a good
idea?

The Ethnologue is an important resource. I use it, along with other
resources, in my work. But I don't think it is mature enough to BE an
international standard. A namespace in RFC 1766 could be created easily:
define a tag "e-" for "Ethnologue" and allow it next to "i-" and "x-". But
I have grave concerns about the wisdom of doing so, and nothing Peter has
said has dispelled them.

Lest anyone think that *I* am coming to shrink from supporting minority
languages, I will say this:

It appears that six characters needed to support Chipeywan in Canadian
Syllabics are missing from the UCS. I'll be looking further into this and
taking appropriate action to get the missing characters encoded in the
Universal Character Set.

Michael Everson
Language Tag Reviewer, RFC 1766





Re: the Ethnologue

2000-09-13 Thread Misha Wolf

The Library of Congress is very closely involved with ISO 639-2.
In fact, it is mostly their list of codes.

Misha


> Oh Michael...
> 
> > I think there are codes given to entities in the Ethnologue list that
> > aren't languages in the sense that we need to identify languages in IT
> > and in Bibliography
> 
> ISO 639, and every other "standard" for language/locale codes also has this problem, 
>and from what I remember of the last old version of Ethnologue that I looked at in 
>great detail... the Ethnologue database has them in smaller quantity than other 
>standards.
> 
> > I don't see what the hurry is.
> 
> Can you say "Library of Congresss"?
> 
> One of the arguments that representatives of the U. S. Government keep making to 
>people who attend their shows at the IUC conferences each year... is that nobody 
>knows where the next earthquake, flood, war, famine, bomb, or other disaster, be it 
>political or natural, is going to occur.  And when there is an "incident", language 
>data and translations and many things are needed IMMEDIATELY, not in ten years or 
>more when ISO wakes up.
> 
> One of the major PROBLEMS with ISO 639, and other such lists developed by ISO over 
>the years, is that they are not brought into being, or maintained, with the intent of 
>being comprehensive.  They are either intended to, or do serve, some short-term 
>narrow interests.
> 
> Governments, libraries, and businesses throughout the world have needed a 
>comprehensive language and locale identification system for many years.  ISO has not 
>provided it.  One place to start is with a comprehensive list of "languages" -- 
>however you define that; and please define it at least with fair consistency.  The 
>Ethnologue is a place to start.
> 
> Can anyone point me to an existing list of languages that is more comprehensive and 
>better researched than the Ethnologue?  If there is no such list, then we don't need 
>to consider any alternatives, right?
> 
>   Rick
> 
> 

-
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.



Re: the Ethnologue

2000-09-13 Thread Antoine Leca

Peter Constable wrote:
> 
> A tag that denotes a group of languages serves no useful
> purpose for most language-specific processes. For example, if all you know
> about the language of some information object is that it is an Athapascan
> language, you can't spell-check that information.

While I agree with you, there are anyway problems with the way languages
are distinguished.
For example, I know quite well two languages, three if we add English.
And their situation concerning spell checking are quite different.

With French (this is viewed with a France's point of view, Canadians,
Belgians and Swiss may view things differently), the spellings are quite
uniform, at least this allows an useful use of list of words to check spell.

With Valencian, this is viewed (either with ISO 639 any part or Ethnologue)
as a dialect of Catalan. The problem is that the spelling of Standard
Valencian is clearly established (and I am *not* talking about the alternative
spellings that are sometimes in use in Valencia or even more in the Balear
Islands), and it differs in some points with Catalan practice. These
points include: the termination of 1st person of present of indicative,
the whole subjunctive, the ordinals adjectives and the feminine possessive.
As a result, any operation of spell checking leads to quite a number of
false positives. Here, the solution is quite easy: doing specific lists of
words for Valencian (this exists for some tools, particularly for the public
domain softwares); however, there are no solution in sight about the tagging
of data. And Ethnologue does not seem of help, particularly since it seems to
aggregate the deviations I mentioned above with the tentative from a (minor)
part of Valencians to create a different spelling, specific to Valencian
concerns (as Ethnologue correctly notes, "The standard dialect is a literary
composite which no one speaks"; so specific local 'solutions' are easy to
design, particularly if intermixed with political problems of primacy and
rivalry between Barcelona and Valencia).

With English, the problem with spell checking is quite different, and different
lists of words would not be as easy for a solution: the en-US vs. en-GB
tagging does not seem to adequately cover the various differences such as
-ise vs. -ize, -our vs. -or, -re vs. -er, use of shall vs. will at 1st person,...
Or more precisely, if it does, that is if "en-GB" is intended to always cover
the first case in the pairs above, then I believe it will be of less use to
people (this is as I understand things; certainly people much more proficient
with English will contradict me here; please allow for my lack of knowledge in
this field and try to extract the point from my explanations. Thanks.)
So here the solution with spell checking is more to allow "parametrisation"
of the checking process, according to the user's taste and practice. While
this is an feasible solution for English, this is not as easy for all languages.
And certainly this is a process that does not fit well with tagging...


I have no firm idea for what should be the form of a list of languages.

But I am _sure_ that any list will lead to problems, due to the fuzziness
of the borders between languages. And while this problem is more or less
possible to deal with when it comes to the major languages with abundant
literature and standardized spelling, at the very time it narrows to lesser
used languages, problems will arise.

> Change is needed as the objects described change and as our knowledge of
> the objects change. This is no less true of several ISO standards: 10646,
> 3166,... It is especially true of 639: for example, currently if someone
> wants to tag a document containing Hopi text, they would need to use the
> tag nai "North American Indian (other)". Suppose in two years time there is
> a specific code for Hopi added to ISO 639-2; consider what happens to that
> existing data: it is now *incorrectly* tagged (not just sub-optimally
> tagged), because nai no longer includes Hopi since that now has its own
> code. Every time a new code is added to ISO 639, the meaning of some
> existing codes changes.

The problem you mentioned with the incorrect tagging of Hopi is inherent to
any persistent use of an information that uses a varying database.
If Ethnologue is merged with (or into) ISO 639, this problem won't fade away,
because the linguistic map of the planet is alive (not to mention political
pressures like what I spoke about Valencian above). So if CLN (I am sorry,
I do not know Hopi's situation, so I cannot comment on your specific example)
if CLN is split, with a special code for Valencian created, then this very
day all literature in Valencian would be *now* incorrectly tagged. Exactly
the same case as you described above. The same, except for one point: the
number of documents that might be affected...

I do not expect this problem to have any cure.


Antoine



Re: the Ethnologue

2000-09-13 Thread addison

I dunno.

The problem here is that ISO639 has, for better or worse, been adopted by
a wide array of DIFFERING applications. It's a convenience standard that
we vaguely have to live with.

One problem here is that it is being used to define BOTH languages and
locales... and the POSIX locale model in particular has to struggle
against the limitations of using a language tag (in combination with a
3166 tag) to define writing systems, cultural conventions, and other
information that doesn't follow directly from a "language" on its own. And
the "language negotiation" "feature" of the Web is based on LANGUAGE tags
that look exactly like POSIX locale codes... hmmm

I support the IDEA of registering all the language codes that linguists
need in a standard for use in language tagging---and that this
standard be ISO639/RFC1766 seems logical. But I'd also favor a
rethink of the POSIX locale model while we're at it, to rid ourselves of
this ambiguity.

Regards,

Addison

Addison P. Phillips
Globalization Engineering Consultant
Inter-Locale, LLC

On Tue, 12 Sep 2000, John Hudson wrote:

> Rick McGowan wrote:
> 
> >One of the major PROBLEMS with ISO 639, and other such lists developed by
> ISO over the years, is that they are not brought into being, or maintained,
> with the intent of being comprehensive.  They are either intended to, or do
> serve, some short-term narrow interests.
> 
> >Governments, libraries, and businesses throughout the world have needed a
> comprehensive language and locale identification system for many years.
> ISO has not provided it.  One place to start is with a comprehensive list
> of "languages" -- however you define that; and please define it at least
> with fair consistency.  The Ethnologue is a place to start.
> 
> >Can anyone point me to an existing list of languages that is more
> comprehensive and better researched than the Ethnologue?  If there is no
> such list, then we don't need to consider any alternatives, right?
> 
> I agree with everything Rick has said except his conclusions. As I
> suggested to Peter Constable after his presentation at the Unicode
> conference, the first task should _not_ be to populate any standards with
> Enthologue codes or, of that matter, any other set of codes. The first
> tasks should be to a) identify the different kinds of information that need
> to be represented by tags (spoken languages, written languages, literary
> languages (not the same thing as a written languages), particular
> orthographies, language-specific script variants, ?, ?) and then b)
> identify appropriate existing standards (if any actually exist) or develop
> new standards to contain these tags. At the same time, the scope of these
> standards should be clearly identified and rules introduced to govern the
> addition of future tags (the kind of rules that don't result in a standard
> containing codes for both individual languages and language groups).
> Without such an approach, any new standard work will be plagued with
> exactly the kind of inconsistencies that make both ISO 639 and the
> Ethnologue of dubious merit for IT purposes.
> 
> This strikes me as a much more useful direction than trying to shove new
> tags into already inconsistent standards that were originally designed for
> other purposes. It is also a lot more useful than, for instance, trying to
> forcefully align OpenType LangSys tags with ISO 639 codes, as has been
> suggested simply because the latter is a STANDARD, when it is far from
> clear that the two indicate the same kind of information.
> 
> John Hudson
> 
> Tiro TypeworksA man was meant to be doubtful about
> Vancouver, BC himself, but undoubting about the truth;
> www.tiro.com  this has been exactly reversed.
> [EMAIL PROTECTED] G.K. Chesterton
> 








Re: the Ethnologue

2000-09-13 Thread John Hudson

At 02:10 AM 9/14/2000 -0700, [EMAIL PROTECTED] wrote:

>The problem here is that ISO639 has, for better or worse, been adopted by
>a wide array of DIFFERING applications. It's a convenience standard that
>we vaguely have to live with.

No, it's an inconvenience standard that we vaguely have to live with. :)

John Hudson

Tiro Typeworks  A man was meant to be doubtful about
Vancouver, BC   himself, but undoubting about the truth;
www.tiro.comthis has been exactly reversed.
[EMAIL PROTECTED]   G.K. Chesterton



Re: the Ethnologue

2000-09-13 Thread Rick McGowan

Re the Linguasphere, Peter C wrote:

> - As Chris mentioned, the info isn't available online.

Actually, the Linguasphere is available on-line, if you pay for it... One hundred 
sixty pounds sterling (two hundred seventy-five US dollars) for a license to use the 
electronic version.

Rick



 


RE: the Ethnologue

2000-09-13 Thread Ayers, Mike


> With English, the problem with spell checking is quite 
> different, and different
> lists of words would not be as easy for a solution: the en-US 
> vs. en-GB
> tagging does not seem to adequately cover the various 
> differences such as
> -ise vs. -ize, -our vs. -or, -re vs. -er, use of shall vs. 
> will at 1st person,...
> Or more precisely, if it does, that is if "en-GB" is intended 
> to always cover
> the first case in the pairs above, then I believe it will be 
> of less use to
> people (this is as I understand things; certainly people much 
> more proficient
> with English will contradict me here; please allow for my 
> lack of knowledge in
> this field and try to extract the point from my explanations. Thanks.)
> So here the solution with spell checking is more to allow 
> "parametrisation"
> of the checking process, according to the user's taste and 
> practice. While
> this is an feasible solution for English, this is not as easy 
> for all languages.
> And certainly this is a process that does not fit well with tagging...

The en-US vs en-GB case gets mentioned a lot.  I can't speak for the
British, but I know that the "British" variants mentioned above are all
perfectly acceptable in American English - just rarely used.

What I'd really like to know is why there seems to be this
insistence on only one official list of languages when there appears to be a
clear need for two.  There appears to be interest for a comprehensive, if
imperfect, list on one hand, whereas other applications (web use, etc.) are
interested in a fully researched list like RFC1766 provides.  Why must these
be the same list?  Can't we acknowledge that it's going to take a long time
to get everything right and work from two eventually converging lists?  Just
wonderin'...


/|/|ike



RE: the Ethnologue

2000-09-13 Thread Ayers, Mike


> From: Arnt Gulbrandsen [mailto:[EMAIL PROTECTED]]

> 
> Are there valid reasons why the imperfect but comprehensive 
> needs to be a
> standard? I can see one reason for it _not_ to be a standard: 
> A list can
> be added to faster, so it's easier for a list to be truly 
> comprehensive.
> 

Yes - consistency, for starters.  Even if the requirements for
adding languages are very liberal, the procedures for organizing the list
could still be kept very tight, thus ensuring that data which relies on such
a list remains interpretable.  By using an inclusive hierarchy, the impact
of evolutionary changes could be minimized (e.g.  Hopi text is marked as
"Native American general" instead of "Native American, unclassified" so that
when the Hopi tag gets added, existing text is inexactly, but correctly,
specified).

Next is politics.  A standards body could negotiate the sometimes
sensitive political issues regarding language classifications competently -
a list could not handle the situation at all, IMHO.

Finally, there's that standard thing.  It was earlier stated that
the parties interested in a comprehensive set of language tags include
government agencies, which are typically required to use standards.


/|/|ike



RE: the Ethnologue

2000-09-13 Thread Michael Everson

Ar 09:04 -0800 2000-09-13, scríobh Ayers, Mike:
>> With English, the problem with spell checking is quite
>> different, and different
>> lists of words would not be as easy for a solution: the en-US
>> vs. en-GB
>> tagging does not seem to adequately cover the various
>> differences such as
>> -ise vs. -ize, -our vs. -or, -re vs. -er

It does not. The most common forms are:

civilize, color, center (US)
civilize, colour, centre (GB-Oxonia)
civilise, colour, centre (GB-Demotica)

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: the Ethnologue

2000-09-13 Thread John Cowan

Michael Everson wrote (amplified by me):
 
> tire, civilize, color, center (US)
> tyre, civilize, colour, centre (GB-Oxonia)
> tyre, civilise, colour, centre (GB-Demotica)
  tire, civilise, colour, centre (CA)

I have seen a photograph of an actual Canadian sign saying "Tire Centre",
which in GB would be a place you go to get more tired

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: the Ethnologue

2000-09-13 Thread Misha Wolf

> It takes a long time for data to work its way into an ISO standard.

This generalisation is unhelpful.  Consider ISO 4217, the currency code 
standard.  As soon as the Maintenance Agency (MA) has been notified by a 
competent authority (in this case, a central bank) of a legitimate 
currency code change, all subscribers are sent a fax (soon to be an email) 
informing them of the change.

For example, I have here ISO 4217 amendment 109, dated 12 September 2000, 
announcing a change to the currency of Ecuador, effective 13 September 
2000.

I'm not sure how long it takes before the ISO 4217 (public) Web site is 
updated.

Above I wrote "a legitimate currency code change", as central banks 
sometimes ask for codes which are already in use, etc.

Michael, please tell us how long it takes the ISO 639-2 MA to update the 
standard, following receipt of a legitimate request.

Misha

[This mail was written using voice recognition software]


> Perhaps
> another organization (like the Unicode Consortium) could take it upon
> itself to massage the Ethnologue langauge list and add corrections,
> deletions, and insertions; and put the new list on-line as "the most
> up-to-date information on language tags." Dialect, script information, and
> ISO 2 and 3-letter tags should also be added for each language if the list
> is redeveloped. I have my own personal list of this type, and from
> time-to-time I make revisions, so as to have "the best information on
> languages" available at my fingertips.
> 
> John F. 


-
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.



Re: the Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 01:39:37 AM J%ORG KNAPPEN wrote:

>I once looked at the ethnologue and its subdivision of the german language
>is just ridiculous. Not small errors, a gross misconception. I don't trust
>the ethnologue in area where I don't know the fact well, since it fails in
one
>area where I know them.

I'm not a specialist on German, and so can't comment on those details. I
will say, though, that there will always be someone with this kind of
complaint on *any* enumeration of languages simply because different people
apply different operational definitions. That is one of the key issues Gary
and I discussed in our paper, and it is for that reason that we suggested
distinct namespaces that match different operational definitions. Even if
the Ethnologue fails in this spot under its own definitions (not a point
I'm conceding), it is a fallacy to conclude that the remainder is invalid.
Just because you don't trust it because you're not happy with the one piece
you have opinions on doesn't mean that it isn't useful for a lot of other
users that do want to use it for other areas. It is just as easy to point
to problems with ISO 639-x (actually easier, I think), but I'm not trying
to keep people from using that if it serves their purposes. The point isn't
to flog ISO 639-x but rather to say we need to get serious about moving
forward on providing identifiers for the thousands of other languages
people are interested in.

The 13th edn. of the Ethnologue lists a total of 15 lanuages under the
classification Indo-European/Germanic/West/Continental/High. I don't think
you'd say that *all* of these are wrong; and for the handful that may have
some problems associated, they don't eliminate the utility for people
interested in the thousands of other languages. As for the attributed
problems for that handful, please provide documented information to the
editor, and be willing to make language attitudes secondary to the
operational definitions that Ethnologue is applying. There is every desire
to make the catalog better wherever possible, and input is always accepted,
but there is need to ensure that the information can be corroborated, and
that the conclusions being claimed from the data conform to the definitions
being assumed. In the mean time, let's provide a solution to the language
identification needs of those interested in thousands of languages from
hundreds of language clusters other than High Continental West Germanic who
don't have *any* current way to tag data using identifiers that are even
close to the particular languages they are interested in.

If we want perfection, we'll never get there. We're all better off because
Unicode and ISO 10646 didn't insist on perfection. Let's not impose that
albatross on the domain of language identifiers.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 02:17:52 AM John Hudson wrote:

>The first
>tasks should be to a) identify the different kinds of information that
need
>to be represented by tags (spoken languages, written languages, literary
>languages (not the same thing as a written languages), particular
>orthographies, language-specific script variants, ?, ?) and then b)
>identify appropriate existing standards (if any actually exist) or develop
>new standards to contain these tags...

I have no problem with that, except that it would need to be done in the
right way. The point is to understand the needs of specific forms of
information processing, and to evaluate for each exactly what kinds of
distinctions are needed. In some cases, it will be language per se; for
others, it will be writing system (usually language-specific, but in some
exceptional cases may cross multiple languages), etc. The only problem is
that I suspect we're several years away from understanding all of this. In
the mean time there are people who need language identifiers for their
data. It's in the cases of the more familiar languages (many of them
European), that we may need special cases to deal with distinct notions
such as written vs. spoken vs. literary languages. But for someone dealing
with something like Ancash Quechua, this is all a big herring that is
getting in the way of providing them with the language identifier that they
need. And that is true for the majority of the 6000+ languages that don't
yet have any identifier.

We need to work toward perfection, but if we insist on perfection before we
take a first step, we'll likely never make progress; and in the mean time,
lots of users continue to go without the identifiers that they need -
identifiers that often are in no way affected by the issues for which we're
trying to find the perfect solution.



>Without such an approach, any new standard work will be plagued with
>exactly the kind of inconsistencies that make both ISO 639 and the
>Ethnologue of dubious merit for IT purposes.

I don't understand assertions that the Ethnologue is of dubious merit for
IT purposes that are often made by people without much experience working
with thousands of minority languages when the Ethnologue was created by
people who have been working with thousands of minority languages
specifically for their own IT purposes. SIL has considerable experience
using Ethnologue codes as language identifiers, and while we will
acknowledge that it isn't perfect, it has served our IT purposes very well
- FAR better than ISO 639-x currently can. It is fallacious to look at IT
issues in the context of major languages (which are already covered by
other standards, and which have some special complications due to long
histories of literary tradition and sociolinguistic change and
diversification) and extend those conclusions to the context of minority
languages. And this is about the latter. It's not about replacing en-US
with Ethnologue's "eng", since that will never happen (and it is not what
we would propose). This is about having identifiers for languages like
Cuaiquer (KWI) and hundreds of others in South America rather than having
to use sai "South American Indian (other)" for all of them; or something
for Lahu Shi (KDS) and hundreds of other languages of SE Asia and China
rather than having to use sit "Sino-Tibetan (other)" for all of them; etc.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-13 Thread Peter_Constable


(Apologies for the cross-listing, but this has spanned several lists, and
there are parties on each that are not all on one and that are interested
in the discussion.)


On 09/13/2000 06:37:02 AM Michael Everson wrote:

>Ar 23:56 +0100 2000-09-12, scríobh Christopher J. Fynn:
>
>>A lot of what are listed as "languages" in the Ethnologue are what most
people
>>would call dialects. For instance almost every known dialect of spoken
Tibetan
>>is listed as a separate language in the Ethnolouge although they all
share
>>only one written form.
>
>YES. This is one of the serious problems of the list.

This is a fallacious argument against Ethnologues classification. As Gary
and I point out in our paper, there is not *one, correct* classification of
languages, but potentially many valid classifications serving different
purposes and depending upon ones operational definitions. We uphold
Ethnologues operational definitions, which use mutual non-intelligibility
as a primary factor. For some major languages that are very familiar to
people, this may not give a classification that offers what they're looking
for given their particular purposes, but this isn't about coming up with
new codes for major languages. It's about codes for the thousands of
languages that currently have nothing.



>Ar 22:39 -0800 2000-09-12, scríobh Jörg Knappen:
>
>>I once looked at the ethnologue...
>
>YES. This is one of the serious problems of the list.

I've already responded to this objection, and Michael hasn't added any new
argument.



>If SIL has 2000 real languages they need codes for in real applications,
>then those 2000 (which is a lot) should be proposed to 639 or 1766. That's
>what 639 and 1766 are for. It would be nice to know what the applications
>are.

Did you place the same requirements to demonstrate need including a
statment of what applications were anticipated on those looking for tags
for signed languages? No. You're not being reasonable here. I've already
enumerated several independent agencies looking for tags for all these
languages (the 6000+, not just 2000). It seems that you're continuing to
come back asking for more simply because you're for whatever reason not
wanting to accept that people could really be wanting all that. But you
don't have the pulse on all user needs, as is evident from this discussion.


>I do not think we should adopt all 6000 codes from the Ethnologue as
>"language tags". I am, frankly, shocked that linguists should consider
>doing so so uncritically.

Then perhaps, Michael, you'd like to go through the list and start telling
the linguists, anthropologists, governments, development agencies, etc. of
the world which of the 6000 they should feel free to ignore.



>Do you need a code for !Xóõ? May I ask what for?

For archiving linguistic data. For associating language-specific processes
to work with that data. For categorizing information about that speech
community that may be needed by government agencies interested in
education, or health maintenance, or economic development, or whatever; or
for categorizing similar information used by development or relief
agencies.



>There are (according to
>the Ethnologue) 3000-4000 speakers. According to Anthony Traill's _A !Xóõ
>Dictionary_, "!Xóõ is an unwritten language and its speakers have no
notion
>of linguistic standardization". Well, honestly, whose IT requirements are
>you going to serve? (All the characters used in the dictionary are in the
>UCS. I checked, because that *is* a real and important requirement.)

And if you're going to have a written representation of the language, how
will you apply processes like data validation (aka spell checking),
morphological analysis, etc. if you don't have a way to tag the
language-specific resources needed?


>Do you really need 8 codes for "the German languages"?

It's not German I'm concerned about.


>How many "Tibetans" are there?

Ethnologue lists only one *language* called Tibetan (TIC). It lists 36
languages from the Tibetan family, and if you're interested in (say) Lhomi
(LHM) or Jirel (JUL) or Ladakhi (LBJ), then "Tibetan" doesn't meet your
needs.


>Is Samvedi a language? It "shares many features with Gujarati.
>Survey needed". How many times does "survey needed" appear in the
>Etnologue? How many of them aren't really languages (in the sense that we
>need to implement for IT and libraries (which use IT by the way)) but only
>preliminary studies? HOW DO WE KNOW?

Ethnologue indicates which ones still require more research. Experience has
shown that in most cases these are not simply dialects of existing
languages but are distinct languages in their own right. At any given point
in time, the E

Re: the Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 10:25:21 AM Antoine Leca wrote:

>While I agree with you, there are anyway problems with the way languages
>are distinguished...

Some comments in response:

- This is not primarily about major languages. They generally already have
the identifiers they need. In addition, because of their history of
literary tradition together with subsequent sociolinguistic change and
diversification, they present complications that are not the norm when
considered in relation to thousands of lesser known languages. The aim of
adding thousands of new language identifiers to some standard system is
focused on the thousands of languages that currently have nothing, not to
replace what is already there for the few hundred that are already covered.

- There is no question that some processes require distinctions based on
one or another type of *paralinguistic* notion, such as writing system or
orthographic convention. My guess is that that these distinctions are most
different from a simple enumeration of languages (based on a given
operational definition) exactly in the cases mentioned above. Further
understanding is needed of what processes depend upon distinctions based on
which paralinguistic notions, but that is likely to take quite some time
yet. In the mean time, the needs of those interested in the thousands of
lesser-known languages that have *nothing* in the way of identifiers
shouldn't be neglected. We can improve our systems as we understand the
needs of different processes better. When we get to that point, it is
likely that a comprehensive enumeration of languages will be much more of
an assistance rather than a hindrance.


>I have no firm idea for what should be the form of a list of languages.
>
>But I am _sure_ that any list will lead to problems, due to the fuzziness
>of the borders between languages.

That is precisely because there is no *one, perfect* enumeration of
languages since alternate categorizations based on different operational
definitions may be valid for different purposes. (All points that Gary and
I have made in our paper.) The challenge then is to find a way to provide
different users with differing purposes solutions that suit their purposes.
Our suggestion of alternate namespaces of identifiers permits exactly this.


>And while this problem is more or less
>possible to deal with when it comes to the major languages with abundant
>literature and standardized spelling, at the very time it narrows to
lesser
>used languages, problems will arise.

Actually, in some respects it is major languages that create some
complications that don't apply to lesser-known languages. (Thus some of
your comments.) On the other hand, it is not clear that an attempt to adopt
a comprehensive enumeration of languages will lead to many more problems.
There will *always* be somebody who says they need something different. On
the other hand, if we use the Ethnologue to add coverage for lesser-known
languages to existing systems, many users interested in modern languages
will feel they are a lot closer to what they need. (Those interested in
ancient languages will not have their needs met, but that is beyond SIL's
expertise.) There will still be occasional dissatisfaction, but not the
wholesale frustration that currently exists.



>The problem you mentioned with the incorrect tagging of Hopi is inherent
to
>any persistent use of an information that uses a varying database.
>If Ethnologue is merged with (or into) ISO 639, this problem won't fade
away,
>because the linguistic map of the planet is alive (not to mention
political
>pressures like what I spoke about Valencian above). So if CLN (I am sorry,
>I do not know Hopi's situation, so I cannot comment on your specific
example)
>if CLN is split, with a special code for Valencian created, then this very
>day all literature in Valencian would be *now* incorrectly tagged. Exactly
>the same case as you described above. The same, except for one point: the
>number of documents that might be affected...


This is precisely my point: people object to the Ethnologue because the
information is incomplete and therefore subject to change, but they assume
that ISO 639 is free of criticism in this regard. That is not true, since
ISO 639 is subject to the same problems. In fact, there is much less of a
problem if a comprehensive list of identifiers based on the Ethnologue were
available for two reasons:

1. The Ethnologue will record change history, and any changes would be from
one *known* quantity to another. Hypothetical example: the data is tagged
as "Lahu Shi", but now we know that, 3 years after the data was created, it
was learned that this corresponds to two distinct languages. The data
become sub-optimally tagged, not completely incorrectly tagged.
Furthermore, even though we may not know precisely how the data should be
tagged based on the new knowledge, we a

Re: the Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 11:59:01 AM Rick McGowan wrote:

>Re the Linguasphere, Peter C wrote:
>
>> - As Chris mentioned, the info isn't available online.
>
>Actually, the Linguasphere is available on-line, if you pay for it... One
>hundred sixty pounds sterling (two hundred seventy-five US dollars) for a
>license to use the electronic version.

Sorry: isn't *freely* available, and therefore accessible to casual users,
to users in developing nations, ...


- Peter




RE: the Ethnologue

2000-09-13 Thread Peter_Constable


On 09/13/2000 12:04:24 PM "Ayers, Mike" wrote:

>What I'd really like to know is why there seems to be this
>insistence on only one official list of languages when there appears to be
a
>clear need for two.  There appears to be interest for a comprehensive, if
>imperfect, list on one hand, whereas other applications (web use, etc.)
are
>interested in a fully researched list like RFC1766 provides.  Why must
these
>be the same list?  Can't we acknowledge that it's going to take a long
time
>to get everything right and work from two eventually converging lists?
Just
>wonderin'...

I have no problem with that whatsoever. Creating an alternate namespace
mechanism with Ethnologue codes in a separate namespace seems to offer
exactly what you describe.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-14 Thread J%ORG KNAPPEN

What really makes me wonder, is that the ethnologue seems to ignore the 
vast amount of published information on the german language and its dialects.
There is more than a century of dialetological research on german, and there
are easy accessible publications showing the major and minor subdivisions
of the german language.

The ethnologue gives a very strange picture there, compared to the mainstream
german literature. Maybe, because german dialectologists prefer to publish
in german?

--J"org Knappen

P.S. For fans of the german language, I recommend: 

Werner König, DTV-Atlas zur deutschen Sprache, DTV München, 10th printing 1994,
ISBN 3-423-03025-9

Make sure to get the 10th printing or a latter version, it contains more 
fascinating material. 




Re: the Ethnologue

2000-09-14 Thread Antoine Leca

Peter Constable wrote:
> 
> On 09/13/2000 10:25:21 AM Antoine Leca wrote:
> 
> >While I agree with you, there are anyway problems with the way languages
> >are distinguished...
> 
> Some comments in response:
> 
> - This is not primarily about major languages.

I believe I was not clear enough.

Do you consider Valencian to be a major language?
If yes, why do Ethnologue negate it a different code?
If no, then I was pointing out that even major language such as Catalan
may lead to problem of subclassification. I analyse, with my deficient
knwoledge of German, that the problems that Jörg pointed out with German,
is that he thought Ethnologue went too far in that latter case.

I do not expect, not in fact want, an actual answer. But as I said,
"while I agree with you", anyway we had, have, and shall have, problems
with the tagging of languages; it is vain to expect even solutions
(and certainly not perfection, as you pointed out) in this field.
We have to live with imperfection, misinformation, fuzziness, etc.


> The aim of adding thousands of new language identifiers to some standard
> system is focused on the thousands of languages that currently have
> nothing, not to replace what is already there for the few hundred that
> are already covered.

First, this is a point that was not clear enough for me on the first time.
Perhaps the fact that I did not see any actual list of the potential
languages to be added is a problem here.

Then, my point about Valencian was to highlight that some languages
can be claimed, for political reasons, new codes to be embeeded in a
list of 2,000 new codes, thus leading to later problems of mis-tagging
for the IT industry. An obvious example from Ethnologue is the case
for the various dialects of the Oc language (I do not know if there are
considered for addition or not; but I know quite well what are the
politicals positions in this case, and I only see worms here, and
certainly no solutions to real problems).

 
> We can improve our systems as we understand the
> needs of different processes better. When we get to that point, it is
> likely that a comprehensive enumeration of languages will be much more of
> an assistance rather than a hindrance.

This is where I cannot agree with you.


 
> (All points that Gary and I have made in our paper.)

I am sorry, I was not able to assist at your conference for an annoying
problem of distance... and unfortunately, the paper is not yet online;
So certainly I am misunderstanding some of your points.

I want to apologize about that.


> >And while this problem is more or less
> >possible to deal with when it comes to the major languages with abundant
> >literature and standardized spelling, at the very time it narrows to
> >lesser used languages, problems will arise.
> 
> Actually, in some respects it is major languages that create some
> complications that don't apply to lesser-known languages.

Good point. So I stand corrected here.

> On the other hand, it is not clear that an attempt to adopt
> a comprehensive enumeration of languages will lead to many more problems.

Certainly it will.
It will certainly not solve the problems with the major languages, since it
does not attempt to improve the situation here (and fragmenting some "languages"
such as Serbo-Croatian, Occitan, German or Catalan is not likely to improve
the situation, IMHO).
And about lesser-used languages, while it will recognise some current
practices, it will also introduce some new problems with all others systems
that should now deal with all these new codes (an obvious example is the UI
to tag something: at the moment, often a list with all the code from ISO-639-1
is presented).

Please note that I am *not* implying that this should preventing us to make
that move. I certainly do not want to sustain Michael's position.
But saying that is a cure without any harmful effect to much too strong
according to my taste.

> In fact, there is much less of a problem if a comprehensive list of
> identifiers based on the Ethnologue were available for two reasons:
> 
> 1. The Ethnologue will record change history, and any changes would be from
> one *known* quantity to another.

I am not that sure, because the rules for tagging are not that fixed.

It is obvious that a list with 2,000 codes is better than one with 450.
There is more information. And it will be better with a list of 30,000 codes.

So if you are going to introduce "Lahu Shi" in place of "Sino-Tibetan (Other)",
you certainly increase the precision. Then, if in 3 years from now, there
is another subdivision, then information will again increase. I do not see
where there is a gap in the process here.

Certainly, the point you are making is that the codes should *never* lost
a part of their meaning: either they should stay as is, or they be _replaced_
b

Re: the Ethnologue

2000-09-14 Thread Peter_Constable


On 09/14/2000 04:59:55 AM J%ORG KNAPPEN wrote:

>What really makes me wonder, is that the ethnologue seems to ignore the
>vast amount of published information on the german language and its
dialects.
>There is more than a century of dialetological research on german, and
there
>are easy accessible publications showing the major and minor subdivisions
>of the german language.
>
>The ethnologue gives a very strange picture there, compared to the
mainstream
>german literature. Maybe, because german dialectologists prefer to publish
>in german?

I can't comment with any confidence on what the basis for the Ethnologue's
assessment of Germanic. One possibility is that the research effort has
focused a lot more on the thousands of languages of the Americas, Africa,
Asia, Australia and the Pacific that are less well known (and generally
without any close relative that counts as a major language). As mentioned
earlier, the need for a comprehensive list of language identifiers pertains
more to that large set of languages than it does to a handful of Germanic
languages.

Also, I do encourage you to submit your comments to the Ethnologue editor,
or use the feedback mechanism on the Ethnologue site
(http://www.sil.org/ethnologue/feedback.html).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-14 Thread Peter_Constable


Antoine:

You have made a number of points, and I won't take time to respond to all
of them since it seems to me that we are largely on the same page. Indeed,
adding a comprehensive list of identifiers will not solve all problems;
indeed, problems will forever remain with us precisely because languages
are fuzzy categories that are constantly evolving, and can be categorized
differently according to ones purposes. (You mentioned political issues;
this falls under the discussion of alternate operational definitions of
language serving different purposes, which Gary Simons discuss in our
paper.)

I am sorry if I missed your point on Valencian. I must admit I didn't read
it through carefully because (a) I'm not that familiar with the speech
varieties in question, and (b) I had a very full in-box on this topic to
respond to yesterday.

I'm also sorry that I don't yet have the paper and slides posted where
everybody who wasn't at IUC17 can get at them. It is among the items on the
top of the priority list.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: the Ethnologue

2000-09-14 Thread Roozbeh Pournader



On Wed, 13 Sep 2000, Michael Everson wrote:

> It names Hancock 1990 as the source of this (impossibly incorrect)
> information. In the bibliography there is no Hancock 1990.

Just like The Unicode Standard Version 3.0, page 317, which names
ISIRI 3342 as a source for ZWJ and ZWNJ, but there's no ISIRI 3342 in the
References. ;)

--roozbeh





Re: the Ethnologue

2000-09-14 Thread John Cowan

[EMAIL PROTECTED] wrote:

> I am sorry if I missed your point on Valencian. I must admit I didn't read
> it through carefully because (a) I'm not that familiar with the speech
> varieties in question, and (b) I had a very full in-box on this topic to
> respond to yesterday.

In a nutshell:  The Ethnologue treats Valencian as a dialect of Catalan, which
is correct based on the mutual intelligibility criterion, but they have distinct
orthographies.  Unfortunately, the two are in the same country, so the 3166
trick (en-us vs. en-gb, e.g.) doesn't work.  (If Valenciana has a 3166-2
regional code, we could do something there, perhaps.)

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: the Ethnologue

2000-09-14 Thread Peter_Constable


On 09/14/2000 10:29:52 AM John Cowan wrote:

>In a nutshell:  The Ethnologue treats Valencian as a dialect of Catalan,
which
>is correct based on the mutual intelligibility criterion, but they have
distinct
>orthographies.  Unfortunately, the two are in the same country, so the
3166
>trick (en-us vs. en-gb, e.g.) doesn't work.  (If Valenciana has a 3166-2
>regional code, we could do something there, perhaps.)

Thanks for the clarification. Orthography is a paralinguistic notion that
certain IT processes need to be sensitive to, but that Ethnologue does not
attempt to record. I have said that there is a need for further research to
establish what paralinguistic notions are needed by what processes and what
would be appropriate enumerations for these other notions, and that
Ethnologue codes would not solve that problem. I have also argued that such
enumerations would build off a comprehensive enumeration of languages, not
be a departure into a completely different direction, and that we should
proceed with adopting such a comprehensive list of language identifiers
right away. There's no question that further work will be needed.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: the Ethnologue

2000-09-14 Thread Timothy Partridge

Peter Constable said:

> On 09/13/2000 12:04:24 PM "Ayers, Mike" wrote:

> >What I'd really like to know is why there seems to be this > 
>insistence on only one official list of languages when there appears to be a
> >clear need for two.  There appears to be interest for a comprehensive, if 
>imperfect, list on one hand, whereas other applications (web use, etc.) are
> >interested in a fully researched list like RFC1766 provides.  Why must 
these
> >be the same list?  Can't we acknowledge that it's going to take a long time
> >to get everything right and work from two eventually converging lists? Just
> >wonderin'...

> I have no problem with that whatsoever. Creating an alternate
> namespace mechanism with Ethnologue codes in a separate
> namespace seems to offer exactly what you describe.

I'm wary of having two competing namespaces. As an alternative,
I'd like to suggest something on the lines of en-cockney.
Why not have iso-e-ethnologue as tags? This would be especially
useful where there was just a miscellaneous ISO code.

Applications could choose to parse just the ISO bit, or go for
the full details. When extra languages are added to ISO, the
tags would become out of date, but it would be relatively
easier to identify which of the old tags needed updating.

One potential snag is choosing which ISO tag would prefix a
given Ethnologue tag. Perhaps SIL could give definitive
opinions to avoid user divergence.

 Tim





Re: the Ethnologue

2000-09-16 Thread Michael Everson

Ar 12:04 -0800 2000-09-13, scríobh [EMAIL PROTECTED]:
>In
>the mean time there are people who need language identifiers for their
>data. It's in the cases of the more familiar languages (many of them
>European), that we may need special cases to deal with distinct notions
>such as written vs. spoken vs. literary languages. But for someone dealing
>with something like Ancash Quechua, this is all a big herring that is
>getting in the way of providing them with the language identifier that they
>need. And that is true for the majority of the 6000+ languages that don't
>yet have any identifier.

The Ethnologue lists six different Ancash Quechua, five different Huánaco
Quechuas, and a lot of other Quechuas besides. It's got five kinds of
Italian. How do we evaluate this? And I don't know how many Zapotecos,
there are too many to count. Do we just accept that it's all been evaluated?

Well, then we find errors, and we point them out. And we say, that's why
we're worried about this database. But Peter says that's not good enough,
it's only "anecdotal", and indeed the burden is placed on us to improve the
Ethnologue by filing reports.

I've got Meillet and Cohen's 1924 _Les langues du monde_ here on my desk in
front of me. Like the Ethnologue, it deals with the languages of the world.
It has big lists in it. Would I accept those uncritically either? No.

>This is about having identifiers for languages like
>Cuaiquer (KWI) and hundreds of others in South America rather than having
>to use sai "South American Indian (other)" for all of them; or something
>for Lahu Shi (KDS) and hundreds of other languages of SE Asia and China
>rather than having to use sit "Sino-Tibetan (other)" for all of them; etc.

I agree, these "(other)" categories are unsatisfactory, and I think if I
had been involved with the early drafting of 639-2 I would have complained
rather loudly about it. Sure the Dewey or the LC _cataloguing_ identifier
systems need such groupings (as they do "Romance languages" and "Slavic
languages" but language identification of bibliographical item is a
different thing.

>Perhaps there is a perception that ISO is unresponsive
>leading people not to make their requests. Perhaps the Maintenance Agency
>*is*, in fact, unresponsive.

The MA has revised its working procedures in February and they seem to work
OK. There are voting procedures and consideration procedures.

>That's how you've been coming across in these
>discussions:

I don't represent the 639 Maintenance Agency, though I am the RFC 1766
language tag reviewer.

>rather than saying, "I recognise the need, but have some
>concerns about some details, so lets investigate how we can find the best
>all-around solutions," your response has been, "I am not interested in
>considering the list of languages enumerated in the Ethnologue."

I recognize the need for more languages. My concern with the Ethnologue is
with its classification. I didn't say that I wasn't interesting in
considering what is in the Ethnologue. I said that adopting it uncritically
could be a mistake, especially as it is a work in progress and if we were
suddenly to adopt 6000 tags then we'd be stuck with them forever. You know
how much a fuss there was just because the code for Yiddish was changed
from ji to yi? Well how much fuss is there going to be if we find out that
Upper Kinauri and Lower Kinauri shouldn't really have been given two
different codes? Because we DON'T want to change codes once they have been
used in an RFC 1766 context.

Therefore I am wary of such a huge list. Do you really find this so
unreasonable? I'm not the only one who has expressed this concern.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/16/2000 12:56:31 PM Doug Ewell wrote:

>Here's another thing about the Ethnologue list that has been almost,
>but not quite, addressed. Just so everyone knows, the point here is
>*NOT* that the six or seven thousand additional languages in Ethnologue
>are somehow not worthy of encoding, but that the list is incompletely
>edited and not ready to be enshrined as an international standard or
>as the basis for one.
>
>I downloaded the tab-delimited list (langcodes.tdf) from the SIL FTP
>site and discovered that some abbreviations were duplicated...

Doug, I'm afraid that your assessment is based on a misunderstanding of the
way the information you were seeing was organised. John Cowan touched on
this; I'll explain more fully. First, though, we do acknowledge a fault on
our part for allowing that data to be available without documenting how it
is organised.

The records in the text file you looked at are language-countries. It is
important to understand that the categorization is not reflected by the
records in that file, but by the three-letter codes. The reason for codes
being duplicated is because the languages in question are spoken in more
than one country.

The Ethnologue has, in the past, been maintained in a textual, flat-file
database. It was organised by language-countries to accommodate the
organisation of the published versions, in which the data are presented by
country then by language. A flat-file database was originally used because
the database dates back to before the advent of relational databases. Work
has begun to get the data into a relational structure. Once that is done,
it will be possible to view the data in other ways, including directly by
language.


>I looked
>further and found 614 duplicate cases where the language code and
>primary name were identical, but the list of alternate names differed.

This probably reflects that alternate names are different from one country
to another.


>But it gets worse.  When I stripped out the alternate-names field and
>again checked for duplicated codes, I found 14 (AVL AYL CAG CTO FUV GAX
>GSC GSW JUP MHI MHM MKJ SHU SRC).  Some of these duplicates differ only
>in spelling (CAG 'Chulupi' vs. 'Chulupí')

Spelling differences are indeed an unfortunate example of inconsistency in
the data, and it exists exactly because a non-relational database has been
used. This will be cleaned up.


> but other differences are a
>lot more troubling.  For example, SHU is both 'Arabic, Chadian Spoken'
>and 'Arabic, Shuwa.'  As a non-expert in Arabic, how do I know these
>two names describe the same dialect of Arabic?  (These are certainly
>dialects, not discrete languages.)

The intention is that you can tell that these are considered the same
language because they have the same three-letter code. It is not the name
that indicates the categorization, but the codes. The reason for
encountering two different records with the same code but different names
is that different names are considered the default or preferred form in
each country. Again, once the data has been re-organised relationally, it
will be possible to show that there is a single language, that it is spoken
in different countries, that there are various alternate names used, and
that certain names are associated with or are preferred in certain
countries.



>MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
>Absolutely *everyone* knows there is no one 'Slavic' language; the name
>refers to an entire language family.  This is much more imprecise than
>any of the despised 'Other' codes in ISO 639.

As, I think, Michael Everson pointed out, "Slavic" is presented as one
alternate name that is sometimes used. The Ethnologue is *not* trying to
suggest that MKJ is all Slavic languages. Again, the view into the data has
unfortunately been misleading for you.



>SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
>means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
>'sr' to Ethnologue 'SRC'.  This is likely to cause much more widespread
>trouble than the Hopi example mentioned earlier.

This is exactly an example of what Gary and I have argued: different
categorizations based on different operational definitions for different
purposes, each of which may be valid. The reason that the Ethnologue has
only one category where ISO 639-x has multiple categories is that the two
categorizations are based on different definitions for different purposes.
Ethnologue has only one because no evidence has been provided to indicate
that there are distinct, mutually non-intelligible speech varieties. That's
the primary basis of categorization.

This is not a problem at all. For applications that require

Re: the Ethnologue

2000-09-21 Thread Doug Ewell

Hi Peter,

> The records in the text file you looked at are language-countries. It
> is important to understand that the categorization is not reflected
> by the records in that file, but by the three-letter codes. The
> reason for codes being duplicated is because the languages in
> question are spoken in more than one country.

I definitely would not have guessed that.  There generally are no
country indicators for most languages (creoles and pidgins being a
noteworthy exception), and while it is possible that no languages are
spoken in as many countries as are English and Spanish -- each with
considerable country-specific differences -- there are only a few
separate entries for those two (cf. Mixteco and Zapoteco).

(Yes, I know the Ethnologue's emphasis is on categorizing and 
documenting minority languages.  I know one of the main criticisms of
ISO 639 is that it provides support only for relatively major languages
at the expense of minority languages, but it is possible to err in the
other direction as well.)

> A flat-file database was originally used because the database dates
> back to before the advent of relational databases. Work has begun to
> get the data into a relational structure. Once that is done, it will
> be possible to view the data in other ways, including directly by
> language.

That will certainly make it easier for non-SILers like me to figure out
what is intended, and will reduce misunderstandings.

> There is no, single right way to "tile the plane".
(repeated several times in different messages)

Agreed.  This is a refreshing departure from the position I perceived
earlier, that ISO 639 was severely broken and the Ethnologue approach
was inherently superior.  The truth, of course, is that each approach
has its advantages and drawbacks for language tagging.  639 needs more
codes (and we know the MA's are working on this), and Ethnologue needs,
if not fixing, at least clarifying.

> A universally "politically correct" name in every case is insoluable.
> Simply picking on as a default *for the purposes of implementation of
> the system of identifiers* is reasonable, and is a problem we have to
> be able to solve if we are going to present a view of the data that
> is organised first by language - at the least, you have to list one
> name first. This is certainly going to happen.

That is all I was asking for.  I apologize if it sounded otherwise.

> Ethnologue can supplement ISO codes, but we're not suggesting simply
> adding all the Ethnologue codes to the same namespace. That would not
> work. On the other hand, "i-sil-xxx" would. It is also necessary to
> ensure that, if the category denoted by an instance of "i-sil-xxx"
> matches that of some ISO code, then only the ISO code should be used.
> To deal with this, a mapping between ISO and Ethnologue is needed,
> and that is being worked on.

That is a real solution, one that builds on ISO 639 instead of bashing
it.

Thanks,

-Doug Ewell
 Fullerton, California



Re: the Ethnologue

2000-09-21 Thread Peter_Constable


[Apologies if you already got this. It seems to be bouncing, and so am
sending it again.]



On 09/21/2000 10:52:22 AM Doug Ewell wrote:

[snip]

>Agreed.  This is a refreshing departure from the position I perceived
>earlier, that ISO 639 was severely broken and the Ethnologue approach
>was inherently superior.

[snip]

>That is a real solution, one that builds on ISO 639 instead of bashing
>it.

It has never been the intent to merely bash ISO 639 or to suggest that the
Ethnologue simply replace it. ISO 639 does have at least one serious
problem that I think needs to be solved - the problem of inadequate
documentation. There are some other issues that don't necessarily represent
problems that must be solved in order for it to be useful, but that do
point to that standard having some inherent limitations. But we readily
acknowledge that Ethnologue also has some limitations. Part of what we've
been saying is that no one effort can come up with a list of identifiers
that meets every need.

The main intent, then, is to work toward an overall solution to problems.
ISO 639 has to be part of the solution; we're suggesting as a particular
proposal that Ethnologue can also make some very valuable contributions to
the solution (including helping to solve ISO 639's documentation problem).
One particular way to do this relates to the other aspect to what we're
suggesting: a proposal that RFC1766 support additional namespaces (or
"sub-namespaces" - perhaps that's the better way to describe it).

At any rate, I'm glad to know that you think there may be promise in at
least some of what we're suggesting.


>Thanks,

And thank you for the constructive interaction!



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>




Ethnologue 14 online

2001-07-24 Thread Peter_Constable

After considerable and unfortunate delay, the new Ethnologue site,
including the online version of the 14th Edition, is at last available to
the public: http://www.ethnologue.com/home.asp. There are still refinements
being made, but all the basics are there and working.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

  




Re: (iso639.186) the Ethnologue

2000-09-12 Thread Peter_Constable


On 09/12/2000 12:18:37 PM Michael Everson wrote:

>I thnk there are codes given to entities in the Ethnologue list that
aren't
>languages in the sense that we need to identify languages in IT and in
>Bibliography (which is what the codes are for).

Perhaps there is a cat that needs to be let out of the bag here. ISO 639
codes were primarily intended for bibliography purposes. Gary and I point
out in our paper that the needs of that sector do not necessarily
correspond to the general needs of IT, particularly for language-specific
processing. A tag that denotes a group of languages serves no useful
purpose for most language-specific processes. For example, if all you know
about the language of some information object is that it is an Athapascan
language, you can't spell-check that information. The intro to ISO 639
claims that the standard is intending to serve the needs of a variety of
sectors, but in its current state it is failing to adequately serve some.
We're not arguing that it is of no use, but it is an open question as to
whether bibliographic codes were the best starting point for general IT
use. Regardless, we have them, and they are already in use. The important
question then is how to move forward to find something that will serve all
sectors of IT.

Furthermore, we would contend that the categories enumerated in the
Ethnologue by-and-large *are* the categories that need to be identified for
general IT purposes. In the majority of cases, the distinctions made are
those that would be needed to successfully spell-check, for example. (We
acknowledge that that is not true in all cases; for example, Chinese
spelling would cross multiple languages; and alternate English spellings
are needed for what would generally be considered one language. But these
are the exceptions, not the norm.)


>I think that it is not
>mature for International Standardization. It is a work in progress,
subject
>to change. As such it is a living document.

Change is needed as the objects described change and as our knowledge of
the objects change. This is no less true of several ISO standards: 10646,
3166,... It is especially true of 639: for example, currently if someone
wants to tag a document containing Hopi text, they would need to use the
tag nai "North American Indian (other)". Suppose in two years time there is
a specific code for Hopi added to ISO 639-2; consider what happens to that
existing data: it is now *incorrectly* tagged (not just sub-optimally
tagged), because nai no longer includes Hopi since that now has its own
code. Every time a new code is added to ISO 639, the meaning of some
existing codes changes. That is at least as serious a concern that a person
would likely encounter with any changes to the Ethnologue, and it is
probably more serious. Please don't assume that carefulness in defining ISO
639 will avoid problems. It already has inescapable problems. We need to
understand those problems and learn to manage them, and that will be made
rather easier if we quickly expand to include a comprehensive enumeration
of modern languages. Yes, that will not solve all problems, but it will be
a beneficial move forward.



>I don't see what the hurry is. Make a list of 100 languages that you
*need*
>codes for urgently. Make a list of another 100 after that. Encode
languages
>that you *really* need codes for. That's what I mean by saying "just
>because it's in the list doesn't mean it should get a code".

Considering only those languages in which we have been involved, SIL has an
immediate need for a couple of thousand codes. But we know that many others
have similar large-scale needs that collectively include the entire
Ethnologue list. There are *lots* of people asking for this, not just me,
not just SIL.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





[OT] Re: the Ethnologue

2000-09-16 Thread Doug Ewell

Here's another thing about the Ethnologue list that has been almost,
but not quite, addressed.  Just so everyone knows, the point here is
*NOT* that the six or seven thousand additional languages in Ethnologue
are somehow not worthy of encoding, but that the list is incompletely
edited and not ready to be enshrined as an international standard or
as the basis for one.

I downloaded the tab-delimited list (langcodes.tdf) from the SIL FTP
site and discovered that some abbreviations were duplicated.  I looked
further and found 614 duplicate cases where the language code and
primary name were identical, but the list of alternate names differed.
OK, I thought, I can see that; the list of alternate names was too long
for one line, so they made two lines and split the alternates between
them.  Fair enough.  (It's not quite that clean, but you get the idea.)

But it gets worse.  When I stripped out the alternate-names field and
again checked for duplicated codes, I found 14 (AVL AYL CAG CTO FUV GAX
GSC GSW JUP MHI MHM MKJ SHU SRC).  Some of these duplicates differ only
in spelling (CAG 'Chulupi' vs. 'Chulupí') but other differences are a
lot more troubling.  For example, SHU is both 'Arabic, Chadian Spoken'
and 'Arabic, Shuwa.'  As a non-expert in Arabic, how do I know these
two names describe the same dialect of Arabic?  (These are certainly
dialects, not discrete languages.)

MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
Absolutely *everyone* knows there is no one 'Slavic' language; the name
refers to an entire language family.  This is much more imprecise than
any of the despised 'Other' codes in ISO 639.

SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
'sr' to Ethnologue 'SRC'.  This is likely to cause much more widespread
trouble than the Hopi example mentioned earlier.

Certainly more codes need to be added to ISO 639, and the Maintenance
Agency needs to be sure not to present an image of unresponsiveness
(if in fact they have been guilty of that in the past).  However, they
have their own, existing guidelines for the level at which languages
should be encoded (one written vs. 60 spoken variants) and this must
be respected.  And the duplicated codes in the Ethnologue list must be
edited down to one code each, or the list will not earn the respect for
accuracy that it perhaps deserves.

-Doug Ewell
 Fullerton, California



[OT] Re: the Ethnologue

2000-09-16 Thread Michael Everson

Ar 08:46 -0800 2000-09-16, scríobh Doug Ewell:
>Here's another thing about the Ethnologue list that has been almost,
>but not quite, addressed.  Just so everyone knows, the point here is
>*NOT* that the six or seven thousand additional languages in Ethnologue
>are somehow not worthy of encoding, but that the list is incompletely
>edited and not ready to be enshrined as an international standard or
>as the basis for one.

That is why I used ISO-speak and said that it's not "mature" for
standardization.

>MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
>Absolutely *everyone* knows there is no one 'Slavic' language; the name
>refers to an entire language family.  This is much more imprecise than
>any of the despised 'Other' codes in ISO 639.

In fairness to the Ethnologue's editor, it is possible that MKJ refers to
"Slavic Macedonian" which is a euphemism that arose because Greeks were
complaining that "Macedonian" could be Greek.

>SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
>means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
>'sr' to Ethnologue 'SRC'.  This is likely to cause much more widespread
>trouble than the Hopi example mentioned earlier.

Ick. This has caused problems in ISO 639 as well.

>Certainly more codes need to be added to ISO 639, and the Maintenance
>Agency needs to be sure not to present an image of unresponsiveness
>(if in fact they have been guilty of that in the past).

There were problems in the past. Things are better now, it seems. At least
there are better procedures for doing the work.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





[OT] Re: the Ethnologue

2000-09-16 Thread Doug Ewell

Michael Everson <[EMAIL PROTECTED]> wrote:

>> MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
>> Absolutely *everyone* knows there is no one 'Slavic' language; the
>> name refers to an entire language family.  This is much more
>> imprecise than any of the despised 'Other' codes in ISO 639.
>
> In fairness to the Ethnologue's editor, it is possible that MKJ
> refers to "Slavic Macedonian" which is a euphemism that arose because
> Greeks were complaining that "Macedonian" could be Greek.

Very possible indeed.  I am aware of the sensitive political issues
surrounding the names of languages, countries, scripts, etc.  (Last
night in Sydney, the Taiwanese Olympic team was introduced as "Chinese
Taipei"; when did that name come into use?)

All I am asking in this particular case is for the Ethnologue editor to
assign *one* primary name (and spelling) to each three-letter language
code, and to relegate the other names to alternate status in a
consistent way.  That is the first necessary step, although maybe not
the last, in moving the Ethnologue coding system closer to "maturity."

-Doug Ewell
 Fullerton, California



Re: (iso639.193) the Ethnologue

2000-09-19 Thread Peter_Constable


I've got the revisions to the revisions on the paper sitting on Gary's desk
(was hoping we'd get this online today, but the day's getting old, so
tomorrow is looking more likely). So, I'll return to this discussion and
try to respond to some of the weekend's flurry of messages.


On 09/16/2000 08:21:04 AM Michael Everson wrote:

[snip]

>The Ethnologue lists six different Ancash Quechua, five different Huánaco
>Quechuas, and a lot of other Quechuas besides. It's got five kinds of
>Italian. How do we evaluate this? And I don't know how many Zapotecos,
>there are too many to count. Do we just accept that it's all been
evaluated?
>
>Well, then we find errors, and we point them out. And we say, that's why
>we're worried about this database. But Peter says that's not good enough,
>it's only "anecdotal", and indeed the burden is placed on us to improve
the
>Ethnologue by filing reports.

What I mean here, Michael is this: in the first paragraph above, you
haven't demonstrated that problems exist; you've merely implied that
problems exist based on the assumption that there shouldn't be more than
one Ancash Quechua, etc. This is the kind of thing I'm referring to as
anecdotal: "it's wrong because I don't agree with it".

There is a reason why six different Ancash Quechuas, etc. are listed:
research has indicated that there are that many related but distinct,
mutually non-intelligible, speech varieties there are that have made use of
the name "Ancash Quechua".


>I've got Meillet and Cohen's 1924 _Les langues du monde_ here on my desk
in
>front of me. Like the Ethnologue, it deals with the languages of the
world.
>It has big lists in it. Would I accept those uncritically either? No.

This seems to me to be an important issue: can people involved in creating
standardized systems of language identifiers trust the judgements of
experts from the field of linguistics. I think the answer must be yes for
two reasons:

1. People creating IT standards cannot be experts in all fields, and
certainly cannot all be experts in linguistics, especially of all different
languages and language families of the world. When dealing with something
outside their field of expertise, there must be a willingness to trust the
judgements of experts in that domain, and I think this applies in this
case.

2. The position that those controlling a system of language identifiers
must hold the expertise and be able to make determine how to "tile the
plane" of language variations around the world is based on an invalid
assumption: that there is only one, correct way to tile the plane for use
in IT. There is not one single, correct categorization of languages. This
is one of the key points Gary and I have made in our paper.


>I recognize the need for more languages. My concern with the Ethnologue is
>with its classification.

This seems to argue in favour of the proceeding point: there is no single
consensus on how to enumerate the world's languages, since different people
use different definitions for different purposes. The only solution to that
impossible situation is a system that allows for alternate namespaces, each
based on different particular definitions and maintained by different
authorities.

In various messages, it has sounded like you agree with us that the
international standards process could never cope with providing the
thousands of tags that some existing users need. We are in agreement that
the list of 6000+ Ethnologue codes can't serve as *the* international
standard; and we agree that you could never get everybody to agree on a
list that large - this is precisely our point about categorization. Thus if
you recognize the need for more language tags, then you must like our idea
of namespaces, since that gives us a way to have well-documented codes that
anybody can use to address the full scope of the world's languages, without
requiring that the whole world own the codes. It seems that, in the same
way that the XML community couldn't agree on a single worldwide tag set and
so adopted namespaces, so must the IT community do this for language
tagging.



>You know
>how much a fuss there was just because the code for Yiddish was changed
>from ji to yi? Well how much fuss is there going to be if we find out that
>Upper Kinauri and Lower Kinauri shouldn't really have been given two
>different codes? Because we DON'T want to change codes once they have been
>used in an RFC 1766 context.

This is somewhat overstated. Changing the code for a given meaning from
"abc" to "def" is a serious problem, and it is understandable that people
would be upset. And that is something that the Ethnologue staff is
committed never to do. This is different from changing the categorization
based on impr

RE: Ethnologue 14 online

2001-07-24 Thread Yves Arrouye

> After considerable and unfortunate delay, the new Ethnologue site,
> including the online version of the 14th Edition, is at last 
> available to
> the public: http://www.ethnologue.com/home.asp. There are 
> still refinements
> being made, but all the basics are there and working.

Very nice! Something to get lost into for hours...
YA




RE: Ethnologue 14 online

2001-07-25 Thread Marco Cimarosti

Peter Constable wrote on [EMAIL PROTECTED]:
> After considerable and unfortunate delay, the new Ethnologue site,
> including the online version of the 14th Edition, is at last 
> available to
> the public: http://www.ethnologue.com/home.asp. There are 
> still refinements
> being made, but all the basics are there and working.

Congratulations for the new edition!

I have immediately checked the page about Italy
(http://www.ethnologue.com/show_country.asp?name=Italy), and I verified with
great satisfaction that the mistakes of the previous version have now been
corrected.

In the previous edition, the estimate for the various languages spoken in
Italy approximately summed up to the population of Italy. This was an absurd
result, because it did not take in account the fact that most Italians speak
both the national language AND a local language (or "dialect"). It also led
to an unrealistically low estimate for the Italian language itself.

In the 14th version, the sum of speakers for all languages is much greater
than the population of the country, about twice of it. This is a plausible
figure, which takes into account the widespread bilingualism of the
population.

Also, I notice that the number of speakers of Italian is nearly identical to
Ethnologue's evaluation of the literacy rate; this is also a plausible
figure, because teaching in Italy is generally in Italian (or in German, in
some areas, but with Italian as one of the main subjects).

This consolidation of Ethnologue's data for my own area increases
considerably my confidence in the data for other areas of the world.

_ Marco





Re: [OT] Re: the Ethnologue

2000-09-16 Thread John Cowan

On Sat, 16 Sep 2000, Doug Ewell wrote:

> But it gets worse.  When I stripped out the alternate-names field and
> again checked for duplicated codes, I found 14 (AVL AYL CAG CTO FUV GAX
> GSC GSW JUP MHI MHM MKJ SHU SRC).  Some of these duplicates differ only
> in spelling (CAG 'Chulupi' vs. 'Chulupí') but other differences are a
> lot more troubling.  For example, SHU is both 'Arabic, Chadian Spoken'
> and 'Arabic, Shuwa.'  As a non-expert in Arabic, how do I know these
> two names describe the same dialect of Arabic?  (These are certainly
> dialects, not discrete languages.)

I see the problem: the same language (with the same code) may be preferentially
known by one name in one country and another name in another.  Because
the Ethnologue names languages by country, conflicts like this can appear.
The entry on "Chadian Spoken Arabic" (in Chad) lists "Shuwa Arabic" as a
synonym; the name "Shuwa Arabic" is the primary name in Niger, Nigeria,
and Cameroon.

> MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
> Absolutely *everyone* knows there is no one 'Slavic' language; the name
> refers to an entire language family.  This is much more imprecise than
> any of the despised 'Other' codes in ISO 639.

Again, "Macedonian" is the preferred name in Macedonia, Bulgaria, and
Albania, but "Slavic" is preferred in Greece.

> SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
> means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
> 'sr' to Ethnologue 'SRC'.  This is likely to cause much more widespread
> trouble than the Hopi example mentioned earlier.

By Ethnologue standards of mutual intelligibility, there is only one
language here.

> Certainly more codes need to be added to ISO 639, and the Maintenance
> Agency needs to be sure not to present an image of unresponsiveness
> (if in fact they have been guilty of that in the past).  However, they
> have their own, existing guidelines for the level at which languages
> should be encoded (one written vs. 60 spoken variants) and this must
> be respected.

Precisely.  Unwritten languages, or languages with only a few written
works, or languages whose written form appears only on bamboo, don't
make it into 639-2, which is (like it or not) in practice a standard
for bibliographic use.

In addition, the notion of mapping spoken form A to written form B
on the basis that the speakers of A write B when they need to write
entails the notion that Dongxiang [SCE], a language of the Mongolian family,
is a "dialect" of Chinese in the same sense that Wu Chinese [WUU] is.

> And the duplicated codes in the Ethnologue list must be
> edited down to one code each, or the list will not earn the respect for
> accuracy that it perhaps deserves.

It seems clear from the detailed information that in all 14 cases,
there is only one language, known by different names in different
countries.  Expecting the Ethnologue to solve this problem by fiat,
or even to openly prefer one name over another when nationalist sympathies
decree otherwise, is IMHO not reasonable.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter






Re: [OT] Re: the Ethnologue

2000-09-16 Thread John Cowan


> From: "John Cowan" <[EMAIL PROTECTED]>
> > It seems clear from the detailed information that in all 14 cases,
> > there is only one language, known by different names in different
> > countries.  Expecting the Ethnologue to solve this problem by fiat,
> > or even to openly prefer one name over another when nationalist sympathies
> > decree otherwise, is IMHO not reasonable.
> 
> John, a solution must be acheived, nevertheless. If a large part or even all
> of the Ethnologue is to be used as a part of any of these standards, then it
> must be done.
> 
> In a way, this is one of the only advantages to not giving locale tags any
> significance -- by assigning them numbers, you really are trying to stay out
> of the business of people who have very different ideas about names and
> such. In a world where countries can go to war over lesser matters then
> this, I prefer the numbers to having yet another tightrope to walk. :-(

It does not matter in this case whether the tags are meaningful or not.
Doug wants the Ethnologue to give each of its languages (uniquely tagged)
a single unique worldwide authoritative name.  That's not reasonable
in all cases, though it is in 99.5%.

The issue is not about unique language <-> tag mapping, which we already
have.  It's about a unique language <-> name mapping.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter

On Sat, 16 Sep 2000, Michael (michka) Kaplan wrote:




Re: [OT] Re: the Ethnologue

2000-09-17 Thread Doug Ewell

John Cowan <[EMAIL PROTECTED]> wrote:

> Doug wants the Ethnologue to give each of its languages (uniquely
> tagged) a single unique worldwide authoritative name.  That's not
> reasonable in all cases, though it is in 99.5%.

What names are I supposed to associate with codes like SHU, MKJ, and
SRC in my (possibly hypothetical) application that deals with language
tags?  Such associations are normally expected to be one-to-one.

If Ethnologue codes are going to be regarded as a standard outside the
confines of SIL, each code needs to be associated with a single,
normative name.  Unicode understands this concept, which is why you
have things like U+002E FULL STOP and an explanatory note that this
character is optionally called "period."  Here in the U.S. we would
never call '.' a full stop, always a period (or dot or decimal point),
but in the U.K. the opposite is true, and one normative name had to be
chosen over the other(s).

Spaniards generally refer to their national language as "castellano,"
not "español," but at some point in the ISO 639 process, a decision had
to be made that one name would be preferred over the other.  SIL
evidently felt that way too, as "Castilian" is just one of the many
alternate names given for the primary name "Spanish."  But for the code
GSW, the Ethnologue staff created separate entries for "Allemanisch,"
"Alsatian," and "Schwyzerdütsch," which *may* appease nationalistic
preferences but definitely *does* result in inconsistency and
confusion.

An inconsistent standard can be worse than no standard at all.

-Doug Ewell
 Fullerton, California



Re: [OT] Ethnologue / Swiss German

2000-09-17 Thread Mark Davis

> the Ethnologue staff created separate entries for "Allemanisch,"
> "Alsatian," and "Schwyzerdütsch," which *may* appease nationalistic
> preferences but definitely *does* result in inconsistency and
> confusion.

Interesting example. Some time ago I lived in eastern Switzerland for 4
years, and learned German there. The Swiss German in western Switzerland is
significantly different (I don't know how well I would have understood it
if I hadn't had a Berner as an office-mate at work for several years). The
speech of Alsace (or south around Zermat) was too far afield for me to
really understand. So if one characterizes languages on the basis of mutual
intelligibility, one might in fact distinguish at least Alsatian from Swiss
German. On the other hand, a native speaker (like Martin) might have a
different perspective.

Mark

Doug Ewell wrote:

> John Cowan <[EMAIL PROTECTED]> wrote:
>
> > Doug wants the Ethnologue to give each of its languages (uniquely
> > tagged) a single unique worldwide authoritative name.  That's not
> > reasonable in all cases, though it is in 99.5%.
>
> What names are I supposed to associate with codes like SHU, MKJ, and
> SRC in my (possibly hypothetical) application that deals with language
> tags?  Such associations are normally expected to be one-to-one.
>
> If Ethnologue codes are going to be regarded as a standard outside the
> confines of SIL, each code needs to be associated with a single,
> normative name.  Unicode understands this concept, which is why you
> have things like U+002E FULL STOP and an explanatory note that this
> character is optionally called "period."  Here in the U.S. we would
> never call '.' a full stop, always a period (or dot or decimal point),
> but in the U.K. the opposite is true, and one normative name had to be
> chosen over the other(s).
>
> Spaniards generally refer to their national language as "castellano,"
> not "español," but at some point in the ISO 639 process, a decision had
> to be made that one name would be preferred over the other.  SIL
> evidently felt that way too, as "Castilian" is just one of the many
> alternate names given for the primary name "Spanish."  But for the code
> GSW, the Ethnologue staff created separate entries for "Allemanisch,"
> "Alsatian," and "Schwyzerdütsch," which *may* appease nationalistic
> preferences but definitely *does* result in inconsistency and
> confusion.
>
> An inconsistent standard can be worse than no standard at all.
>
> -Doug Ewell
>  Fullerton, California




Re: [OT] Re: the Ethnologue

2000-09-17 Thread Doug Ewell

Michael Kaplan <[EMAIL PROTECTED]> wrote:

>> Spaniards generally refer to their national language as "castellano,"
>> not "español," 
>
> FWIW, I do not know of any Spaniards who object to "español" for the
> generic language spoken by everyone around the world Castilian
> they reserve for their own (pure) Spanish

Well, perhaps this is another, unintended example of a problem with
incorporating the Ethnologue linguistic distinctions into other
standards without serious review.  If Spaniards consider their language
sufficiently different from the Spanish spoken by Latin Americans,
should there be separate codes for the two, or not?  What about similar
concerns with French vs. Canadian French, American vs. British English,
etc.?  How does this map intelligently to the existing (like it or not)
ISO 639 standard?  Standards intended for widespread use should address
issues like these explicitly.

-Doug Ewell
 Fullerton, California



Re: [OT] Re: the Ethnologue

2000-09-17 Thread Michael \(michka\) Kaplan

Most seem to be okay with the addition of the country/region tag from
ISO-3166 for determing the difference between languages spoken in several
places -- this is usually what is done for English, Arabic, Portuguese,
French, and Chinese, as well.

Under Windows, they just tack on a new sublanguage to create a new LCID...
Spanish, English, and Arabic seem to be duking it out for the largest number
of ones accepted from release to release -- although they are missing a lot
of languages and dialects, on purpose, as well.

I only have a few friends in Spain none of them were offended at those
who would refer to Spanish as español, but none of them were terribly
pleased with referring to Mexican Spanish as Castilian, either. I think it
may be similar to the French vs. Canadian French issue, just with less
emotion behind it.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Sunday, September 17, 2000 1:19 PM
Subject: Re: [OT] Re: the Ethnologue


> Michael Kaplan <[EMAIL PROTECTED]> wrote:
>
> >> Spaniards generally refer to their national language as "castellano,"
> >> not "español,"
> >
> > FWIW, I do not know of any Spaniards who object to "español" for the
> > generic language spoken by everyone around the world Castilian
> > they reserve for their own (pure) Spanish
>
> Well, perhaps this is another, unintended example of a problem with
> incorporating the Ethnologue linguistic distinctions into other
> standards without serious review.  If Spaniards consider their language
> sufficiently different from the Spanish spoken by Latin Americans,
> should there be separate codes for the two, or not?  What about similar
> concerns with French vs. Canadian French, American vs. British English,
> etc.?  How does this map intelligently to the existing (like it or not)
> ISO 639 standard?  Standards intended for widespread use should address
> issues like these explicitly.
>
> -Doug Ewell
>  Fullerton, California
>




RE: [OT] Re: the Ethnologue

2000-09-17 Thread Carl W. Brown

> Michka wrote :

>Most seem to be okay with the addition of the country/region tag from
>ISO-3166 for determing the difference between languages spoken in several
>places -- this is usually what is done for English, Arabic, Portuguese,
>French, and Chinese, as well.

I don't see how one can use ISO-3166 regions.  The region tags don't follow
linguistic breaks even if they are roughly geographic.  For example, if I
wanted to describe the Northeastern dialects of Brazilian Portuguese that
have traits such as pluralizing the article but not the noun and a heavy
native Amarican Indian influence, I would be hard pressed to find an
ISO-3166-2 designation to use.

The region might be described as a set of states where the differences in
culture between the narrow costal strip and high desert interior is greater
that the differences between states.

Carl




Re: [OT] Re: the Ethnologue

2000-09-17 Thread Michael \(michka\) Kaplan

Well, to cover THAT level of variation, there is only the Ethnologue that I
have ever seen. But the specific question was about language differences
that ISO *can* cover.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Carl W. Brown" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Sunday, September 17, 2000 3:41 PM
Subject: RE: [OT] Re: the Ethnologue


> > Michka wrote :
>
> >Most seem to be okay with the addition of the country/region tag from
> >ISO-3166 for determing the difference between languages spoken in several
> >places -- this is usually what is done for English, Arabic, Portuguese,
> >French, and Chinese, as well.
>
> I don't see how one can use ISO-3166 regions.  The region tags don't
follow
> linguistic breaks even if they are roughly geographic.  For example, if I
> wanted to describe the Northeastern dialects of Brazilian Portuguese that
> have traits such as pluralizing the article but not the noun and a heavy
> native Amarican Indian influence, I would be hard pressed to find an
> ISO-3166-2 designation to use.
>
> The region might be described as a set of states where the differences in
> culture between the narrow costal strip and high desert interior is
greater
> that the differences between states.
>
> Carl
>
>




RE: [OT] Re: the Ethnologue

2000-09-17 Thread Carl W. Brown

>John Cowan wrote:

>I see the problem: the same language (with the same code) may be
preferentially
>known by one name in one country and another name in another.  Because
>the Ethnologue names languages by country, conflicts like this can appear.
>The entry on "Chadian Spoken Arabic" (in Chad) lists "Shuwa Arabic" as a
>synonym; the name "Shuwa Arabic" is the primary name in Niger, Nigeria,
>and Cameroon.

> 

>It seems clear from the detailed information that in all 14 cases,
>there is only one language, known by different names in different
>countries.  Expecting the Ethnologue to solve this problem by fiat,
>or even to openly prefer one name over another when nationalist sympathies
>decree otherwise, is IMHO not reasonable.

>John Cowan   [EMAIL PROTECTED]
>One art/there is/no less/no more/All things/to do/with sparks/galore
>   --Douglas Hofstadter

I can understand your point of view as a standards person.

You are right the Ethnologue is not appropriate as a standard.  But that
does not make it useless.  Your quote from Doug points this out.  45 years
ago he & I were into exact things such as number theory and physics.
Topology was as far as we would venture in soft sciences.  Then in 5th grade
I left for Brazil.  We just met up after 45 years.  The idea that language
is both a standard and not a standard thrills both of us because it makes
this field far more complex and intriguing than physics.

Where I see using the SIL is as an extension of the ISO standard.  If there
is no ISO code then use the SIL code.  As far as research goes, you have to
do your own to be able to prepare the locale.  This will eliminate 90% of
the flaky SIL languages.  There either will be no demand or the research
will uncover which of several encodings to use.  Yes this is not a standard
but it is a way to implement until a standard can be developed.  It is
easier to deal with the SIL codes than the i-x codes.  Besides I can not
take any standard that implements i-klingon as a human language too
seriously.

On the other hand if you consider that language is part of cultural
expression and that different languages express ideas specific to the
culture then the SIL is incomplete.  For example, Boont is an English slang
language developed around Booneville California.  This is not listed but
then you have to remember that the list is explicitly funded for the purpose
of translating bibles and I doubt that there is any interest in languages
that are not primary languages.  People who speak Boont also speak English.

Standards as extremely important and they should be solid. They must work
for you but in this business you can not be slaves to them.  The
implementations should be based on standards but be flexible to accommodate
exceptions when needed.  If I use the SIL codes I stand a good chance that
the codes may be the same codes that ISO may adopt and I can avoid a later
conversation.  These codes fit into the 639-2 tables with no program
changes.  For me it is a win-win situation.  I just need to keep track of
them and check every time the ISO standard is updated to insure that new ISO
codes are not using the SIL codes.  If so, then I will have to migrate the
SIL codes.

In practice few sites will implement languages that are not covered by the
639 list.  So these exceptions should be very few and should be manageable.

Carl






RE: [OT] Re: the Ethnologue

2000-09-17 Thread John Cowan

On Sun, 17 Sep 2000, Carl W. Brown wrote:

> I can understand your point of view as a standards person.
> 
> You are right the Ethnologue is not appropriate as a standard.  But that
> does not make it useless.

I am not a "standards person", and I think you have my stand mixed up.
I am in favor of registering the tags in the Ethnologue (except for
those which are *semantically* the same as existing 639-2 languages)
in the RFC 1766 registry in the form i-sil-xxx.

> Where I see using the SIL is as an extension of the ISO standard.

RFC 1766 exists to allow flexible extension to the ISO standard.

> If there
> is no ISO code then use the SIL code.

There are already collisions, so simply using one or the other
gets you into trouble.  For example, ARC is the SIL code for Archi,
a Northern Caucasian language spoken in the Russian Federation.
But you cannot use it in an ISO 639 field, because ARC in 639
represents Aramaic, which is differentiated by SIL into 16 languages.

But under my proposal, Archi is i-sil-arc, and Aramaic is arc.  If
you want to specify Assyrian Neo-Aramaic specifically, you can use
i-sil-aii.

> As far as research goes, you have to
> do your own to be able to prepare the locale.  This will eliminate 90% of
> the flaky SIL languages.  There either will be no demand or the research
> will uncover which of several encodings to use.  Yes this is not a standard
> but it is a way to implement until a standard can be developed.

Locales are by no means the only uses of language tagging.  My primary
interest is in labeling the languages used in multimedia objects, including
text, audio content, or both.

> It is
> easier to deal with the SIL codes than the i-x codes.

What i-x codes?  Currently there are only a few.

> Besides I can not
> take any standard that implements i-klingon as a human language too
> seriously.

Why not?  Human beings speak it (some more fluently than others), and
write texts in it.  Just follow the links from www.kli.org.  It is not
anybody's native language, but neither is Ladino (i-sil-spj).

> On the other hand if you consider that language is part of cultural
> expression and that different languages express ideas specific to the
> culture then the SIL is incomplete.

The notion of a complete list of languages is a phantasm.

> For example, Boont is an English slang
> language developed around Booneville California.  This is not listed but
> then you have to remember that the list is explicitly funded for the purpose
> of translating bibles and I doubt that there is any interest in languages
> that are not primary languages.  People who speak Boont also speak English.

There are many languages listed in the Ethnologue that aren't native
languages.  As for the short ling, the kimmies at SIL were plenty bahl to
omeert it.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter





Re: [OT] Re: the Ethnologue

2000-09-17 Thread Doug Ewell

John Cowan <[EMAIL PROTECTED]> wrote:

> I am in favor of registering the tags in the Ethnologue (except for
> those which are *semantically* the same as existing 639-2 languages)
> in the RFC 1766 registry in the form i-sil-xxx.

and later:

> There are already collisions, so simply using one or the other
> gets you into trouble.  For example, ARC is the SIL code for Archi,
> a Northern Caucasian language spoken in the Russian Federation.
> But you cannot use it in an ISO 639 field, because ARC in 639
> represents Aramaic, which is differentiated by SIL into 16 languages.
>
> But under my proposal, Archi is i-sil-arc, and Aramaic is arc.  If
> you want to specify Assyrian Neo-Aramaic specifically, you can use
> i-sil-aii.

Since I have spent this whole, *very* OT discussion as the contrarian
("devil's advocate" is too polite), I will take this opportunity to say
that now that I understand John's proposal more clearly, I like it and
think it makes a good deal of sense in an RFC 1766 bis environment.

If "i-" tags are just an RFC 1766 thing, then this can work exactly as
John suggested.  OTOH, if they are specified by ISO 639 in any way,
then we would have to use "x-" tags instead, since we are not at
liberty to extend ISO 639 unilaterally.

The mechanism for using these codes would need to be explicitly
specified in RFC 1766 bis, and the rules would have to be the same as
for other "i-" and "x-" codes, namely that ISO 639-1 codes must be used
whenever possible, followed in turn by ISO 639-2 codes, "i-sil-xxx"
Ethnologue codes (whoops, John, that's a real code (for Keo)), other
"i-" codes, and finally "x-" codes.  I think that's what John is
proposing, anyway.

My other concerns about the Ethnologue remain: I still believe there
needs to be one normative name for each language (politically incorrect
though it may be); and some common sense needs to prevail regarding the
scope of the language tag (like exactly how specific we need to be
about the exact dialect of Chinese in a text message).  But John's
proposal might be a solution for those people who really need a
standard language tag for Mukumina.

[Note to Harald:  "RFC 1766 bis" was Carl W. Brown's term for your
draft successor to RFC 1766.  He cited an earlier draft, in which the
proposed guidelines for the second subtag were defined explicitly.]

-Doug Ewell
 Fullerton, California



Re: [OT] Re: the Ethnologue

2000-09-17 Thread John Cowan

On Sun, 17 Sep 2000, Doug Ewell wrote:

> Since I have spent this whole, *very* OT discussion as the contrarian
> ("devil's advocate" is too polite), I will take this opportunity to say
> that now that I understand John's proposal more clearly, I like it and
> think it makes a good deal of sense in an RFC 1766 bis environment.

Hurrah, hurrah!

> If "i-" tags are just an RFC 1766 thing, then this can work exactly as
> John suggested.  OTOH, if they are specified by ISO 639 in any way,
> then we would have to use "x-" tags instead, since we are not at
> liberty to extend ISO 639 unilaterally.

They are an RFC 1766 thing:  "i" is short for IANA, the registration
agency associated with the RFCs.

> The mechanism for using these codes would need to be explicitly
> specified in RFC 1766 bis, and the rules would have to be the same as
> for other "i-" and "x-" codes, namely that ISO 639-1 codes must be used
> whenever possible, followed in turn by ISO 639-2 codes, "i-sil-xxx"
> Ethnologue codes (whoops, John, that's a real code (for Keo)), other
> "i-" codes, and finally "x-" codes.  I think that's what John is
> proposing, anyway.

Just so.  Of course, this rule applies to the review/registry system.
Thus, i-sil-eng would never even be registered, because en serves the
same purpose.

> My other concerns about the Ethnologue remain: I still believe there
> needs to be one normative name for each language (politically incorrect
> though it may be);

I too agree that this would be desirable, but for the sake of 14
cases out of 7000, I wouldn't hold the whole system hostage.

> and some common sense needs to prevail regarding the
> scope of the language tag (like exactly how specific we need to be
> about the exact dialect of Chinese in a text message).

We need to be as specific as we need to be to solve the particular
problem, I guess.  If "zh" is all you need, then use it; otherwise
go to zh-guoyu or zh-yue or whatever.

> But John's
> proposal might be a solution for those people who really need a
> standard language tag for Mukumina.

Exactly so.  And BTW "my proposal" is also Harald Alvestrand's proposal.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter





Re: [OT] Re: the Ethnologue

2000-09-19 Thread Antoine Leca

Doug Ewell wrote:
> 
> Michael Kaplan <[EMAIL PROTECTED]> wrote:
> 
> >> Spaniards generally refer to their national language as "castellano,"
> >> not "español,"

In fact, "castellano" is more like a compromise used to describe the
linguistic situation of Spain. When speaking with Spaniards, native
Castilian people will almost never use the word "Castilian", always
"Spanish" (or the equivalent translations "espaõl", "espagnol", etc.)
In fact, someone which naturally uses "Castilian" instead of "Spanish"
in a conversation have probably another language as mothertongue...

>From what I said (tyhat is, very few), Hispanoamericans use "español",
although they know for sure what "castellano" means. Perhaps even,
the use of "castellano" may denotes the European Spanish when or where
it differs with their own languages.


> > FWIW, I do not know of any Spaniards who object to "español" for the
> > generic language spoken by everyone around the world Castilian
> > they reserve for their own (pure) Spanish

I beg to differ.
 
> Well, perhaps this is another, unintended example of a problem with
> incorporating the Ethnologue linguistic distinctions into other
> standards without serious review.  If Spaniards consider their language
> sufficiently different from the Spanish spoken by Latin Americans,

They don't. At the contrary, they are proud that their language is
spoken all around the world.
Now, they are very well aware that there are differencies; the main
differences are systematic differences in prononciation (ll, y mainly).


> should there be separate codes for the two, or not?  What about similar
> concerns with French vs. Canadian French, American vs. British English,
> etc.?  How does this map intelligently to the existing (like it or not)
> ISO 639 standard?  Standards intended for widespread use should address
> issues like these explicitly.

Most of these differences are related to the spoken languages, and do not
appear in writing. Since IT is mainly related with writing, this is a
more minor point that it may appear at first sight.


Antoine



Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/16/2000 06:15:51 PM "Michael \(michka\) Kaplan" wrote:

>From: "John Cowan" <[EMAIL PROTECTED]>
>> On Sat, 16 Sep 2000, Doug Ewell wrote:
>> > SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
>> > means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
>> > 'sr' to Ethnologue 'SRC'.  This is likely to cause much more
widespread
>> > trouble than the Hopi example mentioned earlier.
>>
>> By Ethnologue standards of mutual intelligibility, there is only one
>> language here.
>
>Well, thisis one that can actually get some of the speakers (or their
>governments) pretty upset, though. And both ISO639-x and rfc1766 have to
>care about such things

As I've been saying, this amounts to differences of operational definitions
(which may not be explicitly and consciously defined). The Ethnologue is
attempting to consistently apply a definition based primarily on mutual
non-intelligibility. There is no question that there are communities that
speak the same "language" (by this definition), but that have distinct
identities for various ethnic, social, religious or political reasons, and
that the distinct identities get carried into their perception of language
categories. Exactly the opposite is also true: e.g. that because people
share a particular written form it is perceived that they all speak the
same language, for instance, "Chinese".

What is crucial here is that there are situations in IT where more than one
way of "tiling the plane" is needed, since different users and different
applications have different requirements. The only resolutions to this
problem are distinct namespaces based on distinct definitions for different
purposes, or chaos, or that some IT needs simply are ignored. The first of
these is the only solution.




>John, a solution must be acheived, nevertheless. If a large part or even
all
>of the Ethnologue is to be used as a part of any of these standards, then
it
>must be done.
>
>In a way, this is one of the only advantages to not giving locale tags any
>significance -- by assigning them numbers, you really are trying to stay
out
>of the business of people who have very different ideas about names and
>such. In a world where countries can go to war over lesser matters then
>this, I prefer the numbers to having yet another tightrope to walk. :-(

This is exactly one of the points Gary and I make in our paper regarding
benifits of dispensing with a requirement that tags be mnemonic.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/16/2000 04:27:45 PM Doug Ewell wrote:

>All I am asking in this particular case is for the Ethnologue editor to
>assign *one* primary name (and spelling) to each three-letter language
>code, and to relegate the other names to alternate status in a
>consistent way.  That is the first necessary step, although maybe not
>the last, in moving the Ethnologue coding system closer to "maturity."

I hope my previous message adequately demonstrated that the information
inside the Ethnologue database already has the desired consistency in
categorization, and have given adequate assurances that the issue of
presenting the information is currently being addressed.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 03:19:32 PM Doug Ewell wrote:

>Well, perhaps this is another, unintended example of a problem with
>incorporating the Ethnologue linguistic distinctions into other
>standards without serious review.  If Spaniards consider their language
>sufficiently different from the Spanish spoken by Latin Americans,
>should there be separate codes for the two, or not?

The answer to such question must be answered in terms of a particular
operational definition of "language" for a given namespace of identifiers.
There is no one "right" answer.


>How does this map intelligently to the existing (like it or not)
>ISO 639 standard?  Standards intended for widespread use should address
>issues like these explicitly.

And there is no way for standards to address such issues without
recognising the role of operational definitions.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 11:39:14 AM Doug Ewell wrote:

>What names are I supposed to associate with codes like SHU, MKJ, and
>SRC in my (possibly hypothetical) application that deals with language
>tags?  Such associations are normally expected to be one-to-one.
>
>If Ethnologue codes are going to be regarded as a standard outside the
>confines of SIL, each code needs to be associated with a single,
>normative name.

A universally "politically correct" name in every case is insoluable.
Simply picking on as a default *for the purposes of implementation of the
system of identifiers* is reasonable, and is a problem we have to be able
to solve if we are going to present a view of the data that is organised
first by language - at the least, you have to list one name first. This is
certainly going to happen.


>But for the code
>GSW, the Ethnologue staff created separate entries for "Allemanisch,"
>"Alsatian," and "Schwyzerdütsch," which *may* appease nationalistic
>preferences but definitely *does* result in inconsistency and
>confusion.

I explained the reason for this earlier. I agree that it can result in
confusion. The solution will be to provide better views into the data,
which we intend to do.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 07:22:05 PM "Carl W. Brown" wrote:

>You are right the Ethnologue is not appropriate as a standard.

If we're assuming a single standard, in the sense of a single "tiling of
the plane" of languages, we're not proposing that the Ethnologue be the
standard. We are suggesting, though, that the need for alternate "tilings"
be acknowledged, and that the Ethnologue would serve well as one "tiling".


>Where I see using the SIL is as an extension of the ISO standard.

Just so.


>As far as research goes, you have to
>do your own to be able to prepare the locale.  This will eliminate 90% of
>the flaky SIL languages.  There either will be no demand or the research
>will uncover which of several encodings to use.

(Flaky? Who wants to admit their language is flaky?)

Indeed, I don't expect anybody to suddenly provide full locale data for
thousands of locales. Indeed, implementers will discover what they need to
do based on user requests, and will have to solve the problem of gathering
the necessary data just as they have to do now.

I'm not aware of any group of users requesting a populated locale database
covering thousands of locales. I am aware of several groups of users asking
for thousands of language identifiers, however.


>On the other hand if you consider that language is part of cultural
>expression and that different languages express ideas specific to the
>culture then the SIL is incomplete.  For example, Boont is an English
slang
>language developed around Booneville California.

If sufficiently-documented data is provided to indicate that Boont counts
as a distinct language, according to the operational definitions assumed,
then I would expect the editorial staff would add this to the Ethnologue.
It's not so much a question of whether there is an interest in Bible
translation into the language, but rather of what the sociolinguistic facts
about the language are. (I have no other knowledge of "Boont", so have no
idea whether it would get counted or not. If it is "slang", my guess would
be that it probably doesn't constitute a complete langauge. At this point,
I'm getting in over my head in terms of understanding of sociolinguistics,
so I'll stop here before I get into more trouble than I might have already
gotten into.)


>Standards as extremely important and they should be solid. They must work
>for you but in this business you can not be slaves to them.  The
>implementations should be based on standards but be flexible to
accommodate
>exceptions when needed.  If I use the SIL codes I stand a good chance that
>the codes may be the same codes that ISO may adopt and I can avoid a later
>conversation.

What is important here is that, where ISO doesn't provide a code, that
users do have some other source of codes for internal and, more
importantly, interchange purposes. Many independent agencies and
individuals are already using Ethnologue codes in this way precisely
because ISO provides very limited coverage.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 08:02:20 PM John Cowan wrote:

>> Where I see using the SIL is as an extension of the ISO standard.
>
>RFC 1766 exists to allow flexible extension to the ISO standard.
>
>> If there
>> is no ISO code then use the SIL code.
>
>There are already collisions, so simply using one or the other
>gets you into trouble.  For example, ARC is the SIL code for Archi,
>a Northern Caucasian language spoken in the Russian Federation.
>But you cannot use it in an ISO 639 field, because ARC in 639
>represents Aramaic, which is differentiated by SIL into 16 languages.
>
>But under my proposal, Archi is i-sil-arc, and Aramaic is arc.  If
>you want to specify Assyrian Neo-Aramaic specifically, you can use
>i-sil-aii.

John is absolutely correct here, and I need to qualify my agreement to
Carl's statement along exactly the lines John is indicating here.
Ethnologue can supplement ISO codes, but we're not suggesting simply adding
all the Ethnologue codes to the same namespace. That would not work. On the
other hand, "i-sil-xxx" would. It is also necessary to ensure that, if the
category denoted by an instance of "i-sil-xxx" matches that of some ISO
code, then only the ISO code should be used. To deal with this, a mapping
between ISO and Ethnologue is needed, and that is being worked on. (This
mapping will also solve an existing and serious problem of ISO 639-x:
inadequate documentation.)



>Locales are by no means the only uses of language tagging.  My primary
>interest is in labeling the languages used in multimedia objects,
including
>text, audio content, or both.

This is a good example of why an enumeration of "languages" based only on
written forms (as found in ISO 639) is insufficient for all user needs.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 10:37:42 PM Doug Ewell wrote:


>Since I have spent this whole, *very* OT discussion as the contrarian

It hasn't been all that off-topic. This has come up on numerous occasions
on this list, and I think is of interest to many of the participants, even
though it isn't strictly about Unicode.

>("devil's advocate" is too polite), I will take this opportunity to say
>that now that I understand John's proposal more clearly, I like it and
>think it makes a good deal of sense in an RFC 1766 bis environment.

John's proposal is one particular implementation of what Gary and I have
proposed in our paper. We favour the creation of a mechanism for distinct
namespaces, however.



>The mechanism for using these codes would need to be explicitly
>specified in RFC 1766 bis,

That may depend upon the exact implementation.


>and the rules would have to be the same as
>for other "i-" and "x-" codes, namely that ISO 639-1 codes must be used
>whenever possible, followed in turn by ISO 639-2 codes,

Absolutely.


>"i-sil-xxx"
>Ethnologue codes (whoops, John, that's a real code (for Keo)), other
>"i-" codes, and finally "x-" codes.  I think that's what John is
>proposing, anyway.

One issue is relative precedence of "i-sil-@@@" codes (where @ is some
ALPHA, to avoid confusion with Keo) and other "i-" codes. Again, we'd
suggest that Ethnologue codes be kept in a distinct namespace (which is one
way to view "i-sil-"), but some issues remain.


>My other concerns about the Ethnologue remain: I still believe there
>needs to be one normative name for each language (politically incorrect
>though it may be);

We will have to address this in some measure in order to present certain
views into the data.


>and some common sense needs to prevail regarding the
>scope of the language tag (like exactly how specific we need to be
>about the exact dialect of Chinese in a text message).

In general, this is determined by application needs together with a
consideration of distinct operational definitions. I think for most users
there will not be too much difficulty in knowing what to use, though, since
the vast majority of new identifiers are each generally of interest only to
a relatively limited set of users (though there are a number of users that
would be interested in the whole lot).


>But John's
>proposal might be a solution for those people who really need a
>standard language tag for Mukumina.

Just so.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/17/2000 11:13:36 PM John Cowan wrote:

>Exactly so.  And BTW "my proposal" is also Harald Alvestrand's proposal.

I wasn't aware of that until Harald mentioned something not too many days
ago.


- Peter




Re: [OT] Re: the Ethnologue

2000-09-20 Thread Peter_Constable


On 09/19/2000 06:01:46 AM Antoine Leca wrote:

>Most of these differences are related to the spoken languages, and do not
>appear in writing. Since IT is mainly related with writing, this is a
>more minor point that it may appear at first sight.

Some domains of IT are mainly interested in writing, but that is by no
means true across the board. Examples include:

- processing that operates on speech data rather than text or text alone
- linguists interested in speech varieties (what they generally think of as
languages)
- governements and development agencies interested in all language
distinctions, including those between unwritten languages



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: [OT] Re: the Ethnologue

2000-09-20 Thread Carl W. Brown

>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
>Sent: Wednesday, September 20, 2000 11:06 AM

>What is important here is that, where ISO doesn't provide a code, that
>users do have some other source of codes for internal and, more
>importantly, interchange purposes. Many independent agencies and
>individuals are already using Ethnologue codes in this way precisely
>because ISO provides very limited coverage.

I agree.  For example when it was brought up that other Turkic languages
might be using the dot less i.  I noticed that the SIL confirmed that
Azerbaijan uses the Latin alphabet.  On the other hand it said that Urum was
"Spoken by ethnic 'Greeks'".  Unless this is some kind of inside joke I can
not imagine any Greek having anything to do with anything Turkish.

I was proposing using the SIL codes to supplement the ISO codes rather than
the IANA codes.

Carl




RE: [OT] Re: the Ethnologue

2000-09-20 Thread Nick Nicholas

>From: "Carl W. Brown" <[EMAIL PROTECTED]>
>>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
>>Sent: Wednesday, September 20, 2000 11:06 AM

>I agree.  For example when it was brought up that other Turkic languages
>might be using the dot less i.  I noticed that the SIL confirmed that
>Azerbaijan uses the Latin alphabet.  On the other hand it said that Urum was
>"Spoken by ethnic 'Greeks'".  Unless this is some kind of inside joke I can
>not imagine any Greek having anything to do with anything Turkish.

Apart from cohabiting in Anatolia for a millenium. :-) In any case, the
Ethnologue is correct about Urum; Urum and Mariupolitan Greek are the two
languages spoken by an ethnically Greek population, which moved to the area
around Mariupol in the Ukraine from Crimea in the 18th century. During
their stay in Crimea, a large part of the population was linguistically
assimilated to Turkic; but the two language groups consider themselves the
same ethnic group (Urum < Rumei "Romans", the mediaeval Greek autonym), and
recently published anthologies of Mariupolitan Greek (in Cyrillic) include
Urum texts.

(Oh, and there's no glyphs in '80s Urum or Mariupolitan that would be out
of place in Ukrainian, in case anyone was interested...)

The Ethnologue does indeed contain inaccuracies and points of contention,
subject to improvement. And its linguistic classification scheme is not
always what meets with the broadest scholarly acceptance. (I worked as a
research assistant on a project on Papuan languages, for instance, where
the researcher had several misgivings.) As Peter Constable has pointed out,
such disagreements are unavoidable in a field in flux, like the linguistic
classification of non-literary languages. At any rate, that the Ethnologue
has the broadest coverage of any source out there, and that it is being
continually refined and improved, is not in dispute. And for the issue of
distinct language tagging, linguistic classification does not seem to me
very germane. In any case, given the nature of the SIL's work, and the
ISO's current coverage, the accuracy of its coverage of Papua New Guinea or
South America is surely more important an issue to evaluate than what it
has to say about Europe.

   Nick Nicholas, TLG, University of California, Irvine
  [EMAIL PROTECTED]www.tlg.uci.edu/~opoudjis
"My most mighty, God-respected, God-glorified, God-promoted, God-governed,
God-magnified Holy Lord King. Health and merriment to your soul, vigour
and well-being to your divine and royal body, prosperity to the benefactions
issuing from your hand, and everything else good and salvific does my
humble self wish to your Holy Majesty on behalf of God Almighty."
--- Miklosich & Mueller I. CLXXXIV; Patriarch to Emperor.





RE: [OT] Re: the Ethnologue

2000-09-20 Thread Carl W. Brown

>From: Nick Nicholas [mailto:[EMAIL PROTECTED]]
>Sent: Wednesday, September 20, 2000 4:48 PM

>Apart from cohabiting in Anatolia for a millenium. :-) In any case, the
>Ethnologue is correct about Urum; Urum and Mariupolitan Greek are the two
>languages spoken by an ethnically Greek population, which moved to the area
>around Mariupol in the Ukraine from Crimea in the 18th century. During
>their stay in Crimea, a large part of the population was linguistically
>assimilated to Turkic; but the two language groups consider themselves the
>same ethnic group (Urum < Rumei "Romans", the mediaeval Greek autonym), and
>recently published anthologies of Mariupolitan Greek (in Cyrillic) include
>Urum texts.

>(Oh, and there's no glyphs in '80s Urum or Mariupolitan that would be out
>of place in Ukrainian, in case anyone was interested...)

Thank you.  That certainly clarifies that issue.  Going one step further
even though the script uses the Cyrillic alphabet how should one treat Latin
characters notably the use of the dot less and dotted i?

Carl





Re: [OT] Re: the Ethnologue

2000-09-21 Thread Antoine Leca

Peter Constable wrote:
> 
> >> > SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
> >> > means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
> >> > 'sr' to Ethnologue 'SRC'.
> >>
> >> By Ethnologue standards of mutual intelligibility, there is only one
> >> language here.
>
> >Well, thisis one that can actually get some of the speakers (or their
> >governments) pretty upset, though.
>
> As I've been saying, this amounts to differences of operational definitions
> (which may not be explicitly and consciously defined). The Ethnologue is
> attempting to consistently apply a definition based primarily on mutual
> non-intelligibility. There is no question that there are communities that
> speak the same "language" (by this definition), but that have distinct
> identities for various ethnic, social, religious or political reasons, and
> that the distinct identities get carried into their perception of language
> categories.


  Hindi, Hindustani, Urdu could be considered co-dialects, but have important
  sociolinguistic differences. Hindi uses the Devanagari writing system, and
  formal vocabulary is borrowed from Sanskrit, de-Persianized, de-Arabicized.
  Literary Hindi, or Hindi-Urdu, has four varieties: Hindi (High Hindi, Nagari
  Hindi, Literary Hindi, Standard Hindi); Urdu; Dakhini; Rekhta. [...]
  Languages and dialects in the Western Hindi group are Hindustani, Bangaru,
  Braj Bhasha, Kanauji, Bundeli; see separate entries.

from the online Ethnologue database, 13th ed.
http://www.sil.org/ethnologue/countries/Inda.html#HND>

Of course, Peter and many people here know that I am taking the worst possible
example. Perhaps one may also fill reports to make clearer that most if not all
of these different entries are mutually intelligible (at least to the extend
that the language I am speaking when speaking of linguistics or of Unicode is
intelligible to the average French-speaking person).


Antoine



Re: [OT] Re: the Ethnologue

2000-09-21 Thread Otto Stolz

Am 2000-09-16 hat Michael Kaplan geschrieben:
> In a way, this is one of the only advantages to not giving locale tags any
> significance -- by assigning them numbers, you really are trying to stay
> out of the business of people who have very different ideas about names and
> such. In a world where countries can go to war over lesser matters then
> this, I prefer the numbers to having yet another tightrope to walk. :-(

And then, they will go to war over the order in which the numbers are
assigned :-(

Best wishes,
   Otto



Re: [OT] Re: the Ethnologue

2000-09-21 Thread Marion Gunn

Arsa Antoine Leca:

> 
>   Hindi, Hindustani, Urdu could be considered co-dialects, but have important
>   sociolinguistic differences. Hindi uses the Devanagari writing system, and
>   formal vocabulary is borrowed from Sanskrit, de-Persianized, de-Arabicized.
>   Literary Hindi, or Hindi-Urdu, has four varieties: Hindi (High Hindi, Nagari
>   Hindi, Literary Hindi, Standard Hindi)...
> 
> from the online Ethnologue database, 13th ed.
> http://www.sil.org/ethnologue/countries/Inda.html#HND>
>

Mm. Maybe a more polite (more PC) turn of phrase might be found than "could be
considered co-dialects", which more than implies, it postulates the existence of a
standard language referent of which the above "could" be considered dialects.

Someone this week, I think it might have been on this list, spoke of languages as
being "allied" to each other. I rather like that. Would it be acceptable to
suggest replacing "co-dialects" with "allied languages"?
mg


>
> Of course, Peter and many people here know that I am taking the worst possible
> example. Perhaps one may also fill reports to make clearer that most if not all
> of these different entries are mutually intelligible (at least to the extend
> that the language I am speaking when speaking of linguistics or of Unicode is
> intelligible to the average French-speaking person).
>
> Antoine

--
Marion Gunn
Everson Gunn Teoranta
<http://www.egt.ie>





Re: [OT] Re: the Ethnologue

2000-09-21 Thread Doug Ewell

Marion Gunn <[EMAIL PROTECTED]> wrote:

>>   Hindi, Hindustani, Urdu could be considered co-dialects...
>
> Mm. Maybe a more polite (more PC) turn of phrase might be found than
> "could be considered co-dialects", which more than implies, it
> postulates the existence of a standard language referent of which the
> above "could" be considered dialects.

Mmm.  I hadn't thought of it that way.  The impression I got from the
prefix "co-" was one of equality among peers, as in "co-author" or
"co-champion"; but now I recognize a separate, contrasting sense of
"co-" to denote subsidiary status, as in "co-pilot."  I suspect the
Ethnologue staff intended the former (polite?) sense, but it could be
intepreted either way as desired.

What fun language is!

-Doug Ewell
 Fullerton, California



Re: [OT] Re: the Ethnologue

2000-09-21 Thread Kevin Bracey

In message <[EMAIL PROTECTED]>
  Doug Ewell <[EMAIL PROTECTED]> wrote:

> Marion Gunn <[EMAIL PROTECTED]> wrote:
> >
> > Mm. Maybe a more polite (more PC) turn of phrase might be found than
> > "could be considered co-dialects", which more than implies, it
> > postulates the existence of a standard language referent of which the
> > above "could" be considered dialects.
> 
> Mmm.  I hadn't thought of it that way.  The impression I got from the
> prefix "co-" was one of equality among peers, as in "co-author" or
> "co-champion"; but now I recognize a separate, contrasting sense of
> "co-" to denote subsidiary status, as in "co-pilot."  I suspect the
> Ethnologue staff intended the former (polite?) sense, but it could be
> intepreted either way as desired.
>
> What fun language is.

As far as I'm aware the co- prefix does mean an equal grouping. Examples that
spring to mind are co-worker, co-conspirator, co-exist, coincidence and
co-operative. I thought co-dialects was a cunningly concise way of saying
that they could all be considered dialects of each other.

I suspect co-pilot was intended as a polite way of NOT saying that the pilot
was secondary to the pilot. But because he clearly is, it looks like a
secondary implication of subsidarity has attached itself to the term, and so
now people start looking for a new term that doesn't imply subsidiarity.
Repeat this cycle until bored, or there are no words left :) 

What fun PC is!

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.acorn.co.uk/



Re: [OT] Re: the Ethnologue

2000-09-21 Thread Marion Gunn

Arsa Kevin Bracey:

>
> As far as I'm aware the co- prefix does mean an equal grouping. Examples that
> spring to mind are co-worker, co-conspirator, co-exist, coincidence and
> co-operative. I thought co-dialects was a cunningly concise way of saying
> that they could all be considered dialects of each other...
>

And so it is, but even the concept "peer" is misleading (some "co-dialects"
priding themselves on being "more equal" than others of their group, some
members of which they may abhor, and deprecate for legal use in their land),
which is also why I favour "allied languages", which neatly sidesteps the
question of hierarchical relationships, real, implied, or created for any
political purpose, so that Croatian-Bosnian-Serbian then become linguistic
allies for solid linguistic reasons, nothing more implied.
mg


>
> --
> Kevin Bracey

Marion Gunn
Everson Gunn Teoranta






Re: [OT] Re: the Ethnologue

2000-09-22 Thread Edward Cherlin

At 6:24 AM -0800 9/21/00, Marion Gunn wrote:
>Arsa Antoine Leca:
>
>>  
>>Hindi, Hindustani, Urdu could be considered co-dialects, but 
>>have important
>>sociolinguistic differences. Hindi uses the Devanagari writing system, and
>>formal vocabulary is borrowed from Sanskrit, de-Persianized, 
>>de-Arabicized.
>>Literary Hindi, or Hindi-Urdu, has four varieties: Hindi (High 
>>Hindi, Nagari
>>Hindi, Literary Hindi, Standard Hindi)...
>>  
>>  from the online Ethnologue database, 13th ed.
>> 
>>http://www.sil.org/ethnologue/countries/Inda.html#HND>
>>
>
>Mm. Maybe a more polite (more PC) turn of phrase might be found than "could be
>considered co-dialects", which more than implies, it postulates the 
>existence of a
>standard language referent of which the above "could" be considered dialects.
>
>Someone this week, I think it might have been on this list, spoke of 
>languages as
>being "allied" to each other. I rather like that. Would it be acceptable to
>suggest replacing "co-dialects" with "allied languages"?
>mg

As long as nobody supposes that the speakers are supposed to be 
allied. Consider

Serbia, Bosnia, Croatia (Serbo-Croatian)
India, Pakistan (Hindi/Urdu)
China, Taiwan, Singapore (Chinese)
North and South Korea
The many Arab countries and dialects
Iran, Afghanistan (Farsi, Dari, Pashto)

or the U.S. and UK in 1776 and 1812. Historical examples could be 
greatly multiplied.
-- 

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland



Re: [OT] Re: the Ethnologue

2000-09-25 Thread Jonathan Coxhead

   On 20 Sep 00, at 9:42, [EMAIL PROTECTED] wrote:

 | On 09/17/2000 11:39:14 AM Doug Ewell wrote:
 | 
 | >What names are I supposed to associate with codes like SHU, MKJ, and
 | >SRC in my (possibly hypothetical) application that deals with language
 | >tags?  Such associations are normally expected to be one-to-one.
 | >
 | >If Ethnologue codes are going to be regarded as a standard outside the
 | >confines of SIL, each code needs to be associated with a single, 
normative
 | >name.
 | 
 | A universally "politically correct" name in every case is insoluable.

   Isn't that exactly what the 3-letter code is? It can be used universally 
and unambiguously to denote any of the languages that are catalogued (or 
Ethnologued).

   It's only when you start addressing humans---in some specific language, 
in some specific locale---that a human-orientated name is needed. And once 
you've got to that point, you are clearly in the realm of human preference, 
and you can invoke whatever political, cultural or social conventions are 
desired by the particular user of the system.

   Maybe the Ethnologue could have a more comprehensive "vocabulary" 
incorporated into it: for each country/language combination, it could list 
all the names of all the *other* country/language combinations---with 
alternates---and notes on the political, cultural or social weights behind 
them. (Most of the entries would be empty, I'd guess.) This would be 
potentially controversial (and a lot of work), but if approached with the 
proper scholarly impartiality, could help its clients avoid many of the 
more egregious errors.

 | Simply picking on as a default *for the purposes of implementation of the
 | system of identifiers* is reasonable, and is a problem we have to be 
able to
 | solve if we are going to present a view of the data that is organised 
first
 | by language - at the least, you have to list one name first. This is
 | certainly going to happen.

   You only have to list one name first in each country/language entry for 
the name of the language in a given country/language. The list of languages 
can readily be sorted by identifier, and so can the names of the other 
country/language combinations listed within. I'd guess there's not much 
controversy about order after that ...

/|
 o o o (_|/
/|
   (_/



RE: [OT] Re: the Ethnologue

2000-11-23 Thread Christopher John Fynn


Peter Constable wrote:

> This is a good example of why an enumeration of "languages" 
> based only on written forms (as found in ISO 639) is 
> insufficient for all user needs.

Of course ISO 639 is insufficient for *all* user needs 
- no standard is. And is there actually a remit for 
ISO 639 to include spoken languages?

Another post mentioned that ISO 639 was started for 
bibliographic purposes. Perhaps ISO 639 should stick 
to being a standard of codes for written languages and 
a separate standard (or a new part of ISO 639) should 
be started for spoken languages. There may just be too 
many conflicts trying to encode both spoken and written 
languages in the same standard with one set of codes. 

Spoken language  is not necessarily at all the same 
thing as written language . 
There are e.g. plenty of mutually incomprehensible 
forms of spoken English which might each deserve a 
code in a standard for spoken languages but probably 
far fewer mutually incomprehensible varieties of written 
English. And the varieties of dialects of written English
do not map neatly to the varieties of spoken English. 
The same is true for other written and spoken
languages.

- Chris



RE: [OT] Re: the Ethnologue

2000-11-30 Thread Elliotte Rusty Harold

At 7:18 AM -0800 11/23/00, Christopher John Fynn wrote:


>Spoken language  is not necessarily at all the same
>thing as written language .
>There are e.g. plenty of mutually incomprehensible
>forms of spoken English which might each deserve a
>code in a standard for spoken languages but probably
>far fewer mutually incomprehensible varieties of written
>English.

I find myself compelled to indulge to some off-topic curiosity here. 
As a native speaker of American English (suburban New Orleans 
dialect, sometimes known as "Yat") I've yet to encounter a spoken 
version of English that I couldn't understand, after at most a couple 
of minutes of accustoming myself to the accent. I've heard some 
pretty thickly accented Englishes (from my perspective) ranging from 
the Cajun bayous of Lousiana to the South Bronx to Yorkshire to New 
Zealand. So far, they were all obviously English, and at least 
intelligible to me. The only times I've had real problems were with 
non-native speakers who had a very limited command of English, and 
even then I was always eventually able to make myself understood and 
vice versa. Could you give some examples that you would consider to 
be "mutually incomprehensible forms of spoken English"?
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible (IDG Books, 1999)   |
|  http://metalab.unc.edu/xml/books/bible/   |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
+--+-+



RE: [OT] Re: the Ethnologue

2000-11-30 Thread Doug Ewell

Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote:

> At 7:18 AM -0800 11/23/00, Christopher John Fynn wrote:
>
>> Spoken language  is not necessarily at all the same thing as
>> written language . There are e.g. plenty of mutually
>> incomprehensible forms of spoken English which might each deserve a
>> code in a standard for spoken languages but probably far fewer
>> mutually incomprehensible varieties of written English.
> ...
> I've yet to encounter a spoken version of English that I couldn't
> understand, after at most a couple of minutes of accustoming myself
> to the accent.

I think if you take Christopher's original statement and substitute the
word "Arabic" in place of "English," his point would be proven valid
with a better example.

But Elliotte is basically correct; the differences between dialects of
English are not generally as great as people sometimes make them out
to be.  Sure, it can be a challenge initially to understand another
dialect.  I remember being thrown for a loop by a waiter in Hemel
Hempstead, England who asked me, "Are you on holiday?"  At that point
I had three obstacles to overcome:

1. the word "holiday" used for what I would call "vacation"
2. the dropping of the "h" in "'oliday"
3. the European-style high-falling question tone instead of the
   American-style mid-falling-rising tone

But after a second or two I did understand him (and yes, I was indeed
on holiday).

Differences in accent often say more about the speaker than about the
language he is speaking; a Texan who speaks English with a Texas
accent would most likely speak French or Spanish with a Texas accent as
well.  And most of the vocabulary differences are in well-publicized
word pairs like hood/bonnet and elevator/lift.  This is really no
different from hearing a teenager use a vogue word such as "phat" that
has not yet reached the mainstream (and can easily be confused for a
mainstream homonym).

Naturally, this is all coming from a language hack, not a trained
linguist, so please be gentle as you correct my errors.

-Doug Ewell
 Fullerton, California



Re: [OT] Re: the Ethnologue

2000-11-30 Thread John Cowan

Elliotte Rusty Harold wrote:

> I've yet to encounter a spoken
> version of English that I couldn't understand, after at most a couple
> of minutes of accustoming myself to the accent.

You live in a country where dialect differentiation is a feeble thing,
consisting mainly in pronunciation, and where dialect areas stretch for
hundreds or even thousands of miles.  Australia and Northern China
(in the Chinese-speaking region) are about the only other parts of the
Earth with this property.  English elsewhere is more diverse.

In general, Geordie (the traditional dialect spoken around the Tyne
River in England) is considered to be the English dialect most difficult
for North Americans.

In countries where English is widely spoken as a second language (former
parts of the British Empire), the varieties are often very different.
Indians have trouble with Kenyan English and vice versa, IIRC.

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: [OT] Re: the Ethnologue

2000-11-30 Thread Kenneth Whistler

John Cowan noted:

> 
> In general, Geordie (the traditional dialect spoken around the Tyne
> River in England) is considered to be the English dialect most difficult
> for North Americans.

To that I would add Glaswegian. When watching the
Scots-produced mystery shows that show up on PBS in the United
States on occasion, my wife and I often turn to each other
in bafflement and say, "Subtitles, please."

> In countries where English is widely spoken as a second language (former
> parts of the British Empire), the varieties are often very different.
> Indians have trouble with Kenyan English and vice versa, IIRC.

And in response to Elliotte Harold's comment, when encountering
spoken versions of English, one's task of understanding is often
made easier by the fact that an interlocutor will generally try to
move their pronunciation and usage towards what they (and you)
perceive as more standard, specifically to assist in the task
of communication. And varieties heard in media also tend to be
more comprehensible regional norms, precisely because they are
aiming at a wide audience.

--Ken



Re: [OT] Re: the Ethnologue

2000-11-30 Thread John Cowan

Kenneth Whistler wrote:

> To that I would add Glaswegian. When watching the
> Scots-produced mystery shows that show up on PBS in the United
> States on occasion, my wife and I often turn to each other
> in bafflement and say, "Subtitles, please."

Scots is a separate language!  If you understand anything at all
it's by a happy accident.  (There is of course Scots-flavored
English as well, which is another matter.)

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: [OT] Re: the Ethnologue

2000-11-30 Thread Kenneth Whistler

John Cowan replied:

> Kenneth Whistler wrote:
> 
> > To that I would add Glaswegian. When watching the
> > Scots-produced mystery shows that show up on PBS in the United
> > States on occasion, my wife and I often turn to each other
> > in bafflement and say, "Subtitles, please."
> 
> Scots is a separate language!  If you understand anything at all
> it's by a happy accident.  (There is of course Scots-flavored
> English as well, which is another matter.)

I was, of course, referring to Scots (alleged) English, and not
to Scots Gaelic.

--Ken



Re: [OT] Re: the Ethnologue

2000-11-30 Thread John Cowan

On Thu, 30 Nov 2000, Kenneth Whistler wrote:

> > Scots is a separate language!  If you understand anything at all
> > it's by a happy accident.  (There is of course Scots-flavored
> > English as well, which is another matter.)
> 
> I was, of course, referring to Scots (alleged) English, and not
> to Scots Gaelic.

Sae wes I.  But Scotland's a twa-leidit fowkrick (Scots an Scots Gaelic),
o three gif we coont "mim-mou'd Sudron".

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter





Re: [OT] Re: the Ethnologue

2000-09-16 Thread Michael \(michka\) Kaplan

From: "John Cowan" <[EMAIL PROTECTED]>
> On Sat, 16 Sep 2000, Doug Ewell wrote:
> > SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
> > means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
> > 'sr' to Ethnologue 'SRC'.  This is likely to cause much more widespread
> > trouble than the Hopi example mentioned earlier.
>
> By Ethnologue standards of mutual intelligibility, there is only one
> language here.

Well, thisis one that can actually get some of the speakers (or their
governments) pretty upset, though. And both ISO639-x and rfc1766 have to
care about such things

> It seems clear from the detailed information that in all 14 cases,
> there is only one language, known by different names in different
> countries.  Expecting the Ethnologue to solve this problem by fiat,
> or even to openly prefer one name over another when nationalist sympathies
> decree otherwise, is IMHO not reasonable.

John, a solution must be acheived, nevertheless. If a large part or even all
of the Ethnologue is to be used as a part of any of these standards, then it
must be done.

In a way, this is one of the only advantages to not giving locale tags any
significance -- by assigning them numbers, you really are trying to stay out
of the business of people who have very different ideas about names and
such. In a world where countries can go to war over lesser matters then
this, I prefer the numbers to having yet another tightrope to walk. :-(

michka

Michael Kaplan
a new book on internationalization in VB at
http://www.i18nWithVB.com/





ISIRI 3342 (was Re: the Ethnologue)

2000-09-14 Thread Kenneth Whistler

Roozbeh wrote:

> On Wed, 13 Sep 2000, Michael Everson wrote:
> 
> > It names Hancock 1990 as the source of this (impossibly incorrect)
> > information. In the bibliography there is no Hancock 1990.
> 
> Just like The Unicode Standard Version 3.0, page 317, which names
> ISIRI 3342 as a source for ZWJ and ZWNJ, but there's no ISIRI 3342 in the
> References. ;)

This is the kind of error in the book which should be reported to
[EMAIL PROTECTED], where some action can be taken on it -- rather
than to the general list. (Although I realize this was proffered in
part for its rhetorical effect.)

By the way, the passage in question, on p. 317 of TUS 3.0, does
not name ISIRI 3342 as a source for ZWNJ and ZWJ, but rather as a
source for an adapted example of shaping involving such format
controls:

"The preceding examples are adapted from the Iranian national coded
character set standard, ISIRI 3342, which defines these characters as
'pseudo space' and 'pseudo connection,' respectively."

The proximal source of ZWJ and ZWNJ in Unicode was existing
practice in Xerox and Apple implementations of the Arabic script, as of 
1989. The particular text in question, now on p. 317, was added
to Unicode 2.0 in 1996, as part of the general clarification of
cursive joining in that edition of the standard. The failure to
include the reference for ISIRI 3342 dates from that edition.

If you have an exact bibliographic reference for ISIRI 3342 handy,
that would be helpful for the editor to be able to correct this
reference oversight for the next edition.

--Ken



Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Otto Stolz

Am 2000-09-12 um 17:43 h UCT hat Peter Constable geschrieben:
> ISO 639 codes were primarily intended for bibliography purposes.
> Gary and I point out in our paper that the needs of that sector do
> not necessarily correspond to the general needs of IT, particularly
> for language-specific processing. [...] For example, if all you know
> about the language of some information object is that it is an Athapascan
> language, you can't spell-check that information. The intro to ISO 639
> claims that the standard is intending to serve the needs of a variety of
> sectors, but in its current state it is failing to adequately serve some.
...
> Furthermore, we would contend that the categories enumerated in the
> Ethnologue by-and-large *are* the categories that need to be identified for
> general IT purposes. In the majority of cases, the distinctions made are
> those that would be needed to successfully spell-check, for example. (We
> acknowledge that that is not true in all cases; for example, Chinese
> spelling would cross multiple languages; and alternate English spellings
> are needed for what would generally be considered one language. But these
> are the exceptions, not the norm.)

For many language-specific IT processes involving written language,
such as spell-checking, hyphenating, transliterating (e. g. to Braille),
or audible rendering, it is not enough to know which language you are
dealing with: you also need information about the orthography used.

Orthography is subject to change over time, sometimes several orthograhies
for the same language co-exist, e. g. in transition time-spans or in
neighbouring countries.

For example,
- German orthography has been reformed in 1996; currently, two ortho-
  graphies are legal (e. g. accepted in school assignments): the old
  one, established in 1902, until 2005-07-31, and the new one, effective
  since 1998-08-01; cf. (in German)
  <http://www.ids-mannheim.de/reform/zeitafel.html> (time schedule),
  <http://www.ids-mannheim.de/pub/sprachreport/sr98-extra.pdf> (tutorial),
  and <http://www.ids-mannheim.de/grammis/reform/inhalt.html> (rules);
- France had an orthographic reform for French, in 1991;
- the Dutch spelling-reform of 1934 was enacted 1943 in Belgium,
  and 1947 in the Netherlands; Dutsch spelling was again (marginally)
  reformed in 1995, effective since 1996-08-01;
- Norwegian spelling was reformed in 1907, 1917, and 1938;
- Danish in 1948;
- Spanish in 1910, and again in 1852/55;
- Greek in 1982;
to name just a few. The co-existence of en_US and en_UK has already been
mentioned, im this thread.

Hence, I plead for a tagging-system that allows to represent these dif-
ferences. Currently, all of my WWW pages contain the line:
  
I would rather prefer to incorporate the comment in the tag, as in
the hypothetical:
  
and likewise for other languages, and other applications.

Note that this issue is orthogonal to the country code of RFC 1766.
E. g., both de-AT, de-CH and de-DE could be either spelled the 1902,
or the 1996, way. Hence, the spelling subtag, and the country subtag
should be optional, independend of each other.

I think, the ethnologue lacks information about variant orthographies.
(I last looked in it, a couple of months ago.) Both RFC 1766 and
ISO 639 ignore the issue of variant orthographies.

Best wishes,
   Otto Stolz



Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Rick McGowan

Otto Stolz wrote:

> I think, the ethnologue lacks information about variant orthographies.

Yes, it does.  But that's OK, because we can make a composite tagging system that tags 
orthography separately from language.

So... does anyone have a comprehensive list of orthographies?

Rick


 


Re: [even more OT] Re: the Ethnologue

2000-09-17 Thread Michael \(michka\) Kaplan

From: "John Cowan" <[EMAIL PROTECTED]>

> > Besides I can not
> > take any standard that implements i-klingon as a human language too
> > seriously.
>
> Why not?  Human beings speak it (some more fluently than others), and
> write texts in it.  Just follow the links from www.kli.org.  It is not
> anybody's native language, but neither is Ladino (i-sil-spj).

Don't forget to use 1554 (0x0612) if you need a Windows LCID for Klingon -
Latin and 2578 (0x0A12) for Klingon - pIqaD.

There's nothing more powerful than a user defined area. :-)

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/





Re: [even more OT] Re: the Ethnologue

2000-09-17 Thread Doug Ewell

Michael Kaplan <[EMAIL PROTECTED]> wrote:

> Don't forget to use 1554 (0x0612) if you need a Windows LCID for 
> Klingon - Latin and 2578 (0x0A12) for Klingon - pIqaD.
>
> There's nothing more powerful than a user defined area. :-)

This is, at once, the best argument for and the best argument against
user-defined areas that I have ever seen.

-Doug Ewell
 Fullerton, California



Re: Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Kenneth Whistler

Rick McGowan asked:

> Otto Stolz wrote:
> 
> > I think, the ethnologue lacks information about variant orthographies.
> 
> Yes, it does.  But that's OK, because we can make a composite 
> tagging system that tags orthography separately from language.

I agree that this would be a good idea. script/language/orthography are
3 distinct dimensions for tagging text.

> 
> So... does anyone have a comprehensive list of orthographies?

I rather doubt it. This is an even worse problem than the identification
of languages. It is almost a fractal problem.

There are more or less official spelling reforms of standardized
languages mentioned by Otto. But if you go back to earlier stages
of the languages, you start to run into freeform, nonstandardized
spelling conventions.

Furthermore, there are differences in orthographic conventions
per se, as opposed to spelling differences. This is particularly
the case for recently invented orthographies, developed by
linguists and/or missionaries for formerly unwritten languages.
Usually these are developed on phonological principles, so they
don't yet have a history which results in the proliferation of
arbitrary, archaic spelling conventions, but depending on the
conventions used by a linguist, you can end up with alternate
solutions, and more than one may be in use at any given time.
There may also be differences between formal orthographies, used
in linguistic papers and reference works and practical orthographies
used in teaching materials, for example. And finally, the primary
linguistic materials may, and often do, reflect the idiosyncrasies
of particular linguists and non-linguist recorders, as they
change through time, before anyone rolls out a more or less
standardized orthography for the language in question.

--Ken




Re: Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Peter_Constable


On 09/13/2000 09:09:12 AM Otto Stolz wrote:

>For many language-specific IT processes involving written language,
>such as spell-checking, hyphenating, transliterating (e. g. to Braille),
>or audible rendering, it is not enough to know which language you are
>dealing with: you also need information about the orthography used.

I *entirely* agree. But let us understand two points:

1. Orthography is not the only paralinguistic notion that IT processes
depend upon.

2. Except in a small number of cases, every category in a list of languages
will map to one or more categories in a list of writing systems (excluding
unwritten languages). In other words, the list of writing systems is a
finer enumeration than the list of languages. What that means is that, in
order to arrive at a comprehensive list of writing systems, you're going to
need a comprehensive list of languages anyway.



>Note that this issue is orthogonal to the country code of RFC 1766.
>E. g., both de-AT, de-CH and de-DE could be either spelled the 1902,
>or the 1996, way. Hence, the spelling subtag, and the country subtag
>should be optional, independend of each other.

I would agree.


>I think, the ethnologue lacks information about variant orthographies.
>(I last looked in it, a couple of months ago.) Both RFC 1766 and
>ISO 639 ignore the issue of variant orthographies.

True.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Analysis of ISO 639 and mappings to SIL Ethnologue

2002-02-13 Thread Peter_Constable

[apologies in advance to those who receive this multiple times]


In connection to work that Gary Simons and I have been doing in 
interaction with ISO/TC 37/SC 2/WG 1, we have added some new pages to the 
Ethnologue web site that present an analysis we have done of the existing 
ISO 639 language codes together with a proposed mapping of those codes to 
entries in the SIL Ethnologue. Here are some relevant URLs:

- http://www.ethnologue.com/iso639/ -- entry point and intro (the 
following pages can be reached via links from this page)

- http://www.ethnologue.com/iso639/codes.asp -- a table of current ISO 639 
codes, with links for each code to a report showing our proposed mapping 
to Ethnologue entries

- http://www.ethnologue.com/iso639/analysis.asp -- our analysis of ISO 639 
codes

- http://www.ethnologue.com/iso639/An_analysis_of_ISO_639.pdf -- a paper 
describing the principles by which we derived our proposed mapping and 
some issues that arise from the analysis (this is the paper I presented at 
IUC 20)

- http://www.ethnologue.com/codes/ -- information on the codes used in the 
Ethnologue, with links to downloadable files containing core language data 
from the Ethnologue

>From the first page mentioned above, you will find links to downlodable 
files containing the mapping data.

It should be noted that our analysis and proposed mapping have not been 
approved by ISO, though they are being given consideration. Some of the 
principles we adopted in deciding on mappings may turn out to be different 
from what ISO would want to adopt. We adopted those principles based on 
our understanding of what would best suit the interests of TC 37 (though 
not necessarily TC 46 -- due mainly to the circumstance of which committee 
we happened to be interacting with), and viewed them as a proposal for ISO 
to consider and revise as might be needed.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





SV: Analysis of ISO 639 and mappings to SIL Ethnologue

2002-02-16 Thread Audun H. Lona

The Norwegian and Sami language pages on this web site are unfortunately
so full of errors that they should be removed or corrected immediately
in order to avoid misleading information to be spread. 

An example on http://www.ethnologue.com/show_language.asp?code=LPR
Dialects RUIJA, TORNE, SEA LAPPISH. Ruija is the Finnish name for the
territory covered by Northern-Troms and Finnmark provinces (fylker), its
not a dialect. 

The term "saami" should be replaced with "sami" as in the 639 original
document (i hope). 


Audun Lona 





Re: Analysis of ISO 639 and mappings to SIL Ethnologue

2002-02-17 Thread Stefan Persson

- Original Message -
From: "Audun H. Lona" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "Trond Trosterud "
<[EMAIL PROTECTED]>; "Håvard Hjulstad" <[EMAIL PROTECTED]>
Sent: den 17 februari 2002 01:54
Subject: SV: Analysis of ISO 639 and mappings to SIL Ethnologue


> The Norwegian and Sami language pages on this web site are unfortunately
> so full of errors that they should be removed or corrected immediately
> in order to avoid misleading information to be spread.
>
> An example on http://www.ethnologue.com/show_language.asp?code=LPR
> Dialects RUIJA, TORNE, SEA LAPPISH. Ruija is the Finnish name for the
> territory covered by Northern-Troms and Finnmark provinces (fylker), its
> not a dialect.

Isn't that *both* the Finnish name for the territory *and* the Finnish name
for the dialect?

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





SV: Analysis of ISO 639 and mappings to SIL Ethnologue

2002-02-17 Thread Audun H. Lona

> -Opprinnelig melding-
> Fra: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] På vegne av Stefan Persson
> Sendt: 17. februar 2002 11:32
> Til: Audun H. Lona; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; Trond 
> Trosterud ; Håvard Hjulstad
> Emne: Re: Analysis of ISO 639 and mappings to SIL Ethnologue
> 
> Isn't that *both* the Finnish name for the territory *and* 
> the Finnish name for the dialect?
> 
> Stefan

No, not as far as I know. Since you refer to Finnish, i have just
contacted The Sami parliament consultant in Sami-language and he had not
heard about it. Klaus Peter Nickel, Samisk Gramatikk 1994, calls the
dialects in Northern sami; western and eastern OR sea-sami and "inland"
sami. 

Audun Lona










  1   2   >