Re: [unicode] Re: conformance for unicode 2.x?

2003-06-09 Thread Barry Caplan

At 11:54 AM 6/6/2003 -0700, Mark Davis wrote:
We never put the 2.0 Standard
online.
Any particular reason why someone (I volunteer if no one else does)
can't type int the conformance section (it is plain English text if it is
like the 3.0 chapter 3)?
Also I would be curious how many copies of this extant and extinct book
were actually sold so I know how hard it will be to find a copy. OK, I
know *I* can find a copy at a colleague and have them photocopy the
relelevant section for me, but what about the rank and file developer?

Is this a good thing for this standard to exist only in a long out of
print book? RFCs, even the obsoleted ones, live forever online in
numerous archives...shouldn't Unicode strive for the same
immortality?
I am asking about v2 for a selfish reason, but everything above might as
well be about v1 also .

Barry
We do, of course, keep
copies in
our office, but your best bet is to borrow one from a collegue.
Mark
__
http://www.macchiato.com
► “Eppur si muove” ◄
- Original Message - 
From: Barry Caplan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Friday, June 06, 2003 10:14
Subject: Re: conformance for unicode 2.x?

 Thanks Mark, but I had done all that online searching before I
posted to the list. Is the book (which I no longer have a copy of)
the
only place where the details for conformance for 2.x are archived?

 If so, is that a good idea?

 Barry

 At 11:09 AM 6/5/2003 -0700, Mark Davis wrote:

 If you start on
http://www.unicode.org/ and click
on Start Here,
 you'll get to a page about the Unicode Standard.
 
 In the left-hand column, clicking on Versions of the
Unicode
Standard
 will get you to
http://www.unicode.org/standard/versions/.
 
 In the left-hand column you will see the different versions of
the
 standards. Unicode 2.1.9 takes you to

http://www.unicode.org/standard/versions/enumeratedversions.html#Unic
ode_2_1_9,
 where you will find the major and minor references. If you look
in
the
 book, you'll find conformance is chapter 3.
 
 The Unicode Consortium. The Unicode Standard, Version 2.0
 Reading, MA, Addison-Wesley Developers Press, 1996. ISBN
 0-201-48345-9.
 
 Clauses may be amended by:
 
 Moore, Lisa. Unicode Technical Report #8, The Unicode
Standard,
 Version 2.1, Revision 2. Cupertino, CA, The Unicode
Consortium,
1998.
 
 Mark
 __

http://www.macchiato.com
 â­º â?oEppur si muoveâ? â-
 
 - Original Message - 
 From: Barry Caplan [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, June 05, 2003 10:34
 Subject: conformance for unicode 2.x?
 
 
  I was trying to find the place on unicode.org where
conformance
for
 2.x is defined. I think one of the 2.1.x updates referred back
to
 earlier conformance specs, but I couldn't find them. Any
pointers?
 
  Thanks!
 
  Barry
 
 
 





Yahoo! Groups Sponsor 



To Unsubscribe, send a blank message to:
[EMAIL PROTECTED]
This mailing list is just an archive. The instructions to join the true
Unicode List are on
http://www.unicode.org/unicode/consortium/distlist.html

Your use of Yahoo! Groups is subject to the
Yahoo! Terms of
Service. 
inline: 18ae4c9.jpginline: 18ae4f1.jpg

conformance for unicode 2.x?

2003-06-06 Thread Barry Caplan
I was trying to find the place on unicode.org where conformance for 2.x is defined. I 
think one of the 2.1.x updates referred back to earlier conformance specs, but I 
couldn't find them. Any pointers? 

Thanks!

Barry




Re: conformance for unicode 2.x?

2003-06-06 Thread Barry Caplan
Thanks Mark, but I had done all that online searching before I posted to the list. Is 
the book (which I no longer have a copy of) the only place where the details for 
conformance for 2.x are archived? 

If so, is that a good idea?

Barry

At 11:09 AM 6/5/2003 -0700, Mark Davis wrote:

If you start on http://www.unicode.org/ and click on Start Here,
you'll get to a page about the Unicode Standard.

In the left-hand column, clicking on Versions of the Unicode Standard
will get you to http://www.unicode.org/standard/versions/.

In the left-hand column you will see the different versions of the
standards. Unicode 2.1.9 takes you to
http://www.unicode.org/standard/versions/enumeratedversions.html#Unicode_2_1_9,
where you will find the major and minor references. If you look in the
book, you'll find conformance is chapter 3.

The Unicode Consortium. The Unicode Standard, Version 2.0
Reading, MA, Addison-Wesley Developers Press, 1996. ISBN
0-201-48345-9.

Clauses may be amended by:

Moore, Lisa. Unicode Technical Report #8, The Unicode Standard,
Version 2.1, Revision 2. Cupertino, CA, The Unicode Consortium, 1998.

Mark
__
http://www.macchiato.com
⭺  “Eppur si muove” ◄

- Original Message - 
From: Barry Caplan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, June 05, 2003 10:34
Subject: conformance for unicode 2.x?


 I was trying to find the place on unicode.org where conformance for
2.x is defined. I think one of the 2.1.x updates referred back to
earlier conformance specs, but I couldn't find them. Any pointers?

 Thanks!

 Barry







urban legends just won't go away!

2003-01-29 Thread Barry Caplan
http://archive.devx.com/free/tips/tipview.asp?content_id=4151

Who knew in this day and age flipping bits to change case is still publishable (this 
is from today!)

Barry Caplan
www.i18n.com
Vendor Showcase: http://Showcase.i18n.com


--

Use Logical Bit Operations to Changing Character Case


This is a simple example demonstrating my own personal method.

// to lower case
  public char lower(int c)
  {
   return (char)((c = 65  c = 90) ? c |= 0x20 : c);
  }

//to upper case
  public char upper(int c)
  {
return (char)((c = 97  c =122) ?  c ^= 0x20 : c);
  }
/*
 If I would I could create a method for converting an entire
string to lower, like this:
*/
  public String getLowerString(String s)
  {
 char[] c = s.toCharArray();
 char[] cres = new char[s.length()];
 for(int i=0;ic.length;++i)
 cres[i] = lower(c[i]);
 return String.valueOf(cres);
  }
/*
even converting in capital:
*/
  public String capital(String s)
  {
 return
String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1));
  }
/* using it*/
public static void main(String args[])
  {
 x xx = new x();
 System.out.println(xx.getLowerString(LOWER:  + FRAME));
 System.out.println(xx.upper('f'));
 System.out.println(xx.capital(randomaccessfile));
}





Re: Documenting in Tamil Computing

2002-12-17 Thread Barry Caplan
At 10:34 AM 12/17/2002 +0100, Stephane Bortzmeyer wrote:
 There are various extensions and kluges described in various RFCs
 (ESMTP, MIME, etc. )

All these extensions are referenced in the same RFC, 2821, which is
the authoritative one about SMTP. I do not know any mainstream SMTP
server which does not implement them.

The most important for us is 8BITMIME:

   Eight-bit message content transmission MAY be requested of the server
   by a client using extended SMTP facilities, notably the 8BITMIME
   extension [20].  8BITMIME SHOULD be supported by SMTP servers.


There is another RFC, whose number I forget, that defines should. Essentially it 
says you must not rely on anyone else actually implementing this feature.


 but they are not universally implemented at the server transport
 layer,

This is absolutely wrong. sendmail, Postfix and qmail allow 8-bits
transport for a *very* long time.

Well, aside from the fact that those are not the only 2 pieces of mail transport sw by 
a long shot, this feature   e is a configurable option, and may not always be turned 
on.


 But for arbitrary email from one address to another, you can't rely on it.

I send Latin-1 (ISO 8859-1) emails for more than ten years (and
without using quoted-printable or other similar hacks) to
French-speaking people in various parts of the world and I'm still
waiting for an actual problem.

You're playing with words. 

Not really - this is very clearly dealt with in an RFC that defines SHOULD and 
MUST.


In real life, all SMTP servers support 
8-bits mail because all SMTP servers authors are aware of the issue 
(true, it was long and difficult to convince them all but it 
worked). Any counter-example?

Jungshik Shin wrote: 
Besides, some email servers still don't 
abide by ESMTP standard and don't include '8BITMIME' in their response 
when queried with 'EHLO' although they support 8bit clean transport 
(as you wrote).

I did a quick survey of mail servers in the .com top level domain about 18 months ago 
to see which servers implemented 8bitmime and which didn't.  IIRC, about 20% or more 
did not. As I said earlier, that does not mean 8 nits wouldn't go through anyway if 
they are modern servers, but you can't rely on that.

I would like to do a wider survey if someone could donate some bandwidth or maybe 
someone at W3 who was going to look into this at the time can bring this back to top 
of the things to do list (no names, but I am pretty sure he is on this list...:)

Barry Caplan
www.i18n.com





Re: Documenting in Tamil Computing

2002-12-16 Thread Barry Caplan
At 08:32 PM 12/15/2002 -0500, Jungshik Shin wrote:
 because
 Unicode is not mature enough to be used in multilingual email yet.
 You just have to make do with the 8bit TSCII encoding for Tamil eMail.

  I don't understand what you meant by Unicode not being
mature enough to support multilingual emails. Modern email clients like
Netscape7/Mozilla, MS Outlook (Express), and Mutt support UTF-8 very well.


Actually, it is not Unicode which is nt mature enough. It is SMTP, the core mail 
transport protocol. It is not 8 bit clean. It is very clear in the RFCs that only 7bit 
data is allowed over the wire. 

There are various extensions and kluges described in various RFCs (ESMTP, MIME, etc. ) 
but they are not universally implemented at the server transport layer, let alone at 
the client layer.

So Unicode falls into a (very large) class of encodings that are not safe to pass over 
SMTP because they use 8 bits for the encoding of at least some characters.

This is a well know problem, and some mail servers do not follow the SMTP RFC exactly 
in that they do not specifically strip the 8th bit of all data and turn it to 0. If 
you are lucky and all th e mail servers on the path between you and your recipient act 
this way, then 8 bit data will go through.

But for arbitrary email from one address to another, you can't rely on it.

Barry Caplan
www.i18n.com





RE: UTF-Morse

2002-11-22 Thread Barry Caplan
At 02:37 PM 11/22/2002 +0100, Marco Cimarosti wrote:
Otto Stolz wrote:
 Marco, you shall be called Marcone, or even (granting
 a Pluralis majestatis): Marconi ;-)

And each element shall be called a Morsel

Barry





Re: Patent on æ ø å

2002-11-22 Thread Barry Caplan
I met these guys at a trade show a couple of years ago and without know about this 
claim to fame ended up discussing internationalized URLs. IIRC they mentioned 
something about a patent. I just assume that whatever working groups are standardizing 
international DNS are working around it.

Barry Caplan
www.i18n.com

At 08:24 PM 11/22/2002 +, Michael Everson wrote:
Can there possibly be any truth in any of this?

The following is an article in the Danish paper Information:

http://www.information.dk/Indgang/VisArtikel.dna?pArtNo=136309

Do you know anything about this. It is supposedly the company Walid
(http://www.walid.com/) that has patented the transformation of non-a-z for
use in URLs.

An article in CumputerWorld (admittedly a year and a half old) -
http://www.computerworld.com/managementtopics/ebusiness/story/0,10801,59043,00.html 
- has some references, among other things to the text of the patent.

The Danish site Softwarepatenter.dk has it also: 
http://www.softwarepatenter.dk/walid.html. It is quite new there. Is this
whole thing just hoax?
-- 
Michael Everson * * Everson Typography *  * http://www.evertype.com





Re: Speaking of Plane 1 characters...

2002-11-11 Thread Barry Caplan
At 05:47 PM 11/11/2002 -0500, John Cowan wrote:
Michael Everson scripsit:

 The scale in question is analogous to a temperature scale, not a
 reptilian one.
 
 Now I very *seriously* don't get it.

A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ...
in order.  When you ask What is the temperature?, you are actually asking
What is the scalar value of the temperature?

The Unicode scale enumerates the characters 0, 1, 2, ... 10.  Unicode
scalar values are points on this scale, just as temperature scalar values
are points on the (Celsius) temperature scale.

Well, not exactly...temperature is an arbitrary but standard measure of a continuous 
physical property. The multiple well known scales attest to that. But code points are 
absolute points, not continuous. And because one character has a greater encoding 
value does not make it greater then in any useful sense. 

Basically, we are talking about continuous ordinal scales vs discrete cardinal scales. 
Hardly analogous at all IMM.

Barry Caplan
www.i18n.com






Re: A .notdef glyph

2002-11-07 Thread Barry Caplan
At 12:51 PM 11/7/2002 -0700, John Hudson wrote:

Inspired by this, I have made a new .notdef glyph: 
http://www.tiro.com/transfer/notdef.gif


Can you provide a document which provides this in context and with the traditional 
rectangle? Maybe at various point sizes? Looks a lot like pop art to me but I wouldn't 
head off the discussion yet

Barry Caplan
www.i18n.com





Re: Character identities

2002-10-28 Thread Barry Caplan
At 04:39 PM 10/28/2002 -0600, David Starner wrote:


But think of the utility if Unicode added a COMBINING SNOWCAP and
COMBINING FIRECAP! But should we combine the SNOWCAP with the ICECAP?

(-:

Unicode captures the ice-age during the global warming era!

Do we have codepoints for images found on the walls of caves?

:)

Barry
www.i18n.com





Re: The character @ and gender studies...

2002-10-25 Thread Barry Caplan
Yes - imagine the burden on open relay mailers when they try to blast spam to ill 
formed email addresses they harvested!

Hey wait - maybe this is a *good* idea!

Barry
www.i18n.com

At 02:12 PM 10/25/2002 +0100, Michael Everson wrote:
At 05:31 -0700 2002-10-25, Ramiro Espinoza wrote:
In some latin countries the people involved in gender studies are using the 
character  to mean a/o.

Example: Tods nosotrs (instead of todos nosotros -All of us-).

They try to give a male and female approach to the spanish generic words.

That's pretty horrible. Why don't they just use the letter schwa? :-)
-- 
Michael Everson * * Everson Typography *  * http://www.evertype.com





Re: Origin of the term i18n

2002-10-15 Thread Barry Caplan

At 10:25 PM 10/14/2002 -0700, you wrote:
Hmmph.  It was a mildly interesting question at first, and it wouldn't
have been too bad to see six or eight responses, but by my count we are
up to 52 messages in this thread.  (53, counting this one.)

The participants have either fallen into a religious debate over which
group or individual first came up with the idea -- as if that could ever
be proved conclusively -- or have started a fad of coining silly new

I don't see it as a religious debate or even a debate at all - after all, the 
conclusion was for all intents and purposes on my web site already.

What is more interesting to me is an exploration of the history of 
internationalization now that we have more or less settled when i18n was coined. The 
history is goes through a period of hand wringing about what to even call  what we now 
know as internationalization and localization. 

It wasn't always so clear cut - I made some calls to people I know who aren't in this 
community anymore but who were long ago who might provide some insight. I have an 
article written for me last week by the source in my article last week at my request 
covering some of the history - further back than  we have covered in this thread. I 
intend to post is ASAP on i18n.com except I had a server crash over the weekend. 
Hopefully that will be fixed in the morning and I can get the article to you. There is 
an interesting twist in the story about why, at that time and place, 
internationalization itself was not sufficient as Mark suggested and it is 
persuasive to me.

Then I intend to raise the question of those who were around longer than me of just 
how far back does the idea of internationalization actually go and when was that term 
first used. To me, the two holy grails of computer science from day one have been good 
chess playing programs and machine translation. So at least back into he mid 1950s 
there was a need for multilingual computing of some type. 

I am sure there was a lot of roll your own techniques for a good long time. When did 
these techniques get a name at all, and what was the name and definition? Was it 
something other than internationalization? If so how did it morph to what we know now? 
when did localization come into it?

These are important historical questions and I think wholly appropriate for this list.

You won't see *this* happen every day, but I'm in almost total agreement
with Mark Davis.  Some of these number-based abbreviations may be useful
at times, but for the most part they're like emoticons -- overuse them,
or cross the line inventing new ones, and they immediately become trite
and cutesy.

One of the signs of a mature specialty is a set of jargon and a set of inside humor. 
To me, l10n and i18n are the only ones we should use everyday. I respectfully disagree 
about g11n. The rest may be overdoing it a bit but I see the point if they express a 
concept of i18n/l10n as applied to a specific region or locale beyond the word spelled 
out itself. that is the power of jargon and branding both.

  It
has nil to do with Unicode.


My research over the last week indicates that the origins of Unicode are very 
definitely of the same era and from the same community of the people who brought the 
idea of internationalization to a critical mass, and coined the term i18n. One has not 
been separable from the other since at least 1989.


I can do all that, if it would help kill this thread.

Personally I would love to see it all end up being moved to i18n.com.

There has been a fair amount of off-list discussion going on, btw.

Barry Caplan
www.i18n.com





Re: Origin of the term i18n

2002-10-15 Thread Barry Caplan

At 12:37 AM 10/15/2002 -0700, Doug Ewell wrote:
Barry Caplan bcaplan at i18n dot com wrote:
What I am arguing against is going hog-wild making up new obscure
abbreviations from the same template, and
clogging the Unicode list with them.  Anything beyond i18n and l10n
is tantamount to the man with glasses smoking a cigar and drooling
type of smiley.


Well, some were used in jest by correspondents who often engage in wordplay on list 
and off list truth be told.

But I pointed out that the scheme is a meme picking up steam, and not just in 
software. I didn't make up a12n, even though I hadn't seen it used before. I also 
didn't make up c17g or m17n. I provided evidence of my claims that this is spreading 
by pointers to the sites.

The only reason I did that is because someone (Mark I think but I could be wrong) 
objected the entire abbreviation scheme. the point is it is not going away and it will 
probably be used more and more in different types  of places.

It occurred to me the other day, I haven't had a chance to check this and maybe 
someone else will, that all 4 character domain names under dot com domain, which means 
there may be a lot more sites of the form xdx.com or xddx.com.

Barry Caplan
www.i18n.com






add a12n to the list...

2002-10-12 Thread Barry Caplan
http://lists.kabissa.org/mailman/listinfo/a12n-collaboration

wasn't there a Red Hat Chili Peppers song called c13n?

Barry





Re: Origin of the term i18n

2002-10-11 Thread Barry Caplan

At 11:11 AM 10/11/2002 -0700, Mark Davis wrote:
Sorry to appear the curmudgeon, but I've never seen any but a relatively few
people use this goofy form of abbreviation, and then for only a few of the
words on your web page. A search for normalization and Unicode yields
32,800 enties on Google. A search for n11n yields 3.


I have seen m17n come out of japan and I saw a similar term, algorithm misapplied in a 
totally unrelated context at http://www.christadelphian.org/MEMBERS/index.htm:

Welcome to the inside of C17g.

that's Christadelphian.org shortened - there are 17 characters between the C and the g 
of the name... it saves a lot of typing

Not a trend.

Not a trend but a meme

Mark, I am curious why you find this term so distasteful? Is it the algorithm itself 
or just a general objection to acronyms and the like? Or something else entirely?

Barry Caplan
www.i18n.com






Re: [nelocsig] Re: Origin of the term i18n

2002-10-11 Thread Barry Caplan

At 02:49 PM 10/11/2002 -0400, Tex Texin wrote:
According to XenCraft, if the software industry were to exert its
ability to influence the English language thru its control of message
catalogs used in software thruout the world, numeronyms (n7ms) could
replace words completely by the year 2016 (this is the year not
numeronym).

The research analysts at i18n.com differ in their analysis. They assure me that the 
i18n.com developers can write a Apache module that would convert pages in encoded with 
characters from the traditional single byte encodings, such as the ISO-8859 series, to 
the new format in approximately 15 minutes. Any site that is on a server running 
Apache with mod_perl would then be automatically available in this format with no 
further intervention by the site's authors or owners.

Planned follow on projects include forming a committee to extend precisely how the 
algorithm should apply to languages with more complex writing systems, creating a 
proxy server that browsers can use to convert pages from non-Apache servers, and 
adding support for various wireless browsers.

Once proper funding is secured for the crack i18n.com development team, the conversion 
(c9n) and obsoletion (o8n)  could literally be available overnight.

Barry Caplan
www.i18n.com





Re: Origin of the term i18n

2002-10-11 Thread Barry Caplan
At 12:20 PM 10/11/2002 -0700, Mark Davis wrote:
 Mark, I am curious why you find this term so distasteful? Is it the
algorithm itself or just a general objection to acronyms and the like? Or
something else entirely?

I find this particular way of forming abbreviations particularly ugly and
obscure. 

I think it is a meme that is catching on and it serves various purposes more important 
than saving keystrokes:

- these are important words that describe entire fields of study in many specialties
- many of them (internationalization, globalization, e.g) are in the common 
vernacular, with vague denotations and possibly negative connotations in the general 
public
- As such the words are seriously overloaded and confusing
- Not only that, but they are spelled differently in various parts of the English 
speaking world, which affects indexing.
- They are long and hard to spell for non-native speakers (and probably most US native 
speakers too)
- They are toungue twisters for all, especially for some non-native English speakers
- The overloading of definitions, even within scholarly fields, is calling out for a 
separation and branding (do a search on localization and see how many branches of 
science you get)
- Long words really suck for design purposes. You would be limited to about 9 point 
type on your business card if anything other than your title included 
Internationalization

I am working on digging up some deeper history that might shed more light on how i18n 
was coined initially so stay tuned

As for Apple using internationalization internally by 1985, that would be consistent 
with other evidence of the age of that term wrt (oops with respect to) computer 
software. 

But lets not hold Apple up as a company as a corporate bastion of clear terms. The 
public-facing entire corporate branding strategy since the 1984 release of the Mac has 
been to *not* use functional terms for products. This is just now beginning to change 
with iPhoto, etc. The strategy has always been anti-Microsoft in this regard, and 
Microsoft has always preferred generic terms wherever possible. So if Apple still does 
not use i18n in its docs then it is business as usual wrt to contrariness to  
Microsoft's approach but *not* business as usual wrt the rest of Apple's history. This 
is an interesting place for Apple to be (no pun intended)

Barry Caplan
www.i18n.com

PS - I just checked on developer.i18n.com - it is indeed devoid of references to i18n 
save a couple of Java APIs and totally devoid of l10n.  This must be a long-term 
enforced policy as Mark hinted - I'd love to speak to whoever came up with it - that 
it could stick for at least 17 years given the changes at Apple is pretty remarkable 
in itself!






Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

There is a link with the story on the fron page of www.i18n.com

Barry Caplan
Publisher, www.i18n.com

At 02:02 AM 10/10/2002 -0400, Tex Texin wrote:
I was asked about the origin of these acronyms. Does anyone know who
created these or where they were first used?
tex
-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-





Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

At 08:35 AM 10/10/2002 -0700, Rick wrote:
The earliest reference I can find to i18n in my old e-mail trail is the  
following e-mail to the sun!unicode mail list by Glenn Wright. This was  
Oct 5, 1989. By that time, the term was definitely current, as Mr. Hiura  
suggests.

I registered i18n.com around 94 or so, and the fellow, whose name I am trying hard to 
recall (first name JR, Australian or British IIRC, red hair), seemed to indicate the 
coinage was quite some time before that and he was very surprised when I told him how 
extensive the usage was by then.

I'm a jonny-come-lately when it comes to unix and other standards history... is there 
an searchable archive of windows standards anywhere? How about a cvs server of code? 
It seems to me that i18n or variants could have made it into code as a function name 
almost immediately, or possibly even before being put into a standards doc

It seems to me that l10n was extant by the time I came to CA ~ 1992.

Perhaps Ken Lunde can shed some light - he surely came across a lot of early docs 
while writing his first book, which was a republication of an online archive he 
maintained I think.

Barry





RE: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

How did you find these? I searched on i18n and sorted by date and could not go past 
the 1000th or so record

Barry

At 09:52 PM 10/10/2002 +0300, Tor Lillqvist wrote:
Well, the first occurence of i18n in Google's USENET archive seems
to be http://groups.google.com/groups?selm=5570339%40hpfcdc.HP.COM
from Nov 30, 1989.

l10n occurs first in
http://groups.google.com/groups?selm=1990Aug30.115608.3729%40tsa.co.uk
from Aug 30, 1990.

--tml





RE: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

At 06:35 PM 10/10/2002 +0200, Marco Cimarosti wrote:
Radovan Garabik wrote:
 Google is your friend :-)
 i18n is first mentioned in USENET on 30 nov 1989,


Here is a mention from 1989-12-02 11:24:11 PST only 3 days later:

http://groups.google.com/groups?q=i18n+1988hl=enlr=ie=UTF-8selm=454%40longway.TIC.COMrnum=7


that says:

 5.  Messaging

  The UniForum internationalization (I18N) folks brought forward a
  proposal for a messaging facility to be included in P1003.1b.
  The working group decided that it needs some more work but will
  go into the next draft.

  [Editor's note -- The problem being solved here is that
  internationalized applications store all user-visible strings in
  external files, so that vendors and users can change the

December 1989 Standards Update  IEEE 1003.1: System services interface


- 5 -

  language of an application without recompiling it.  The UniForum
  I18N group is proposing a standard format for those files.]

This indicates to me that UniForum might be a place to look for earlier references

This is a very interesting thread from 1990:

http://groups.google.com/groups?hl=enlr=ie=UTF-8threadm=1990Aug30.115608.3729%40tsa.co.ukrnum=20prev=/groups%3Fq%3Di18n%2B1988%26start%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26selm%3D1990Aug30.115608.3729%2540tsa.co.uk%26rnum%3D20





Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

At 07:34 PM 10/10/2002 -0400, Tex Texin wrote:

Mark Davis wrote:
 
 We used the term internationalization in Apple in late 85. We might have
 also used it earlier than that, I don't remember.
 
 W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r!

Mark,

Given the center of work in the i18n and l10n area that has emerged in Ireland (and 
other places) are you more partial to internationali1ation and locali1ation? :)

Barry
www.i18n.com






Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

At 07:34 PM 10/10/2002 -0400, Tex Texin wrote:
Mark,
that's good to know. I never worked with Apple and so have no Apple doc
in my collection.

However, the W0e below is a violation of the encoding and is a security
risk. I think the algorithm calls for the shortest string, so people
can't sneak in extra nulls- W0e W00e, etc.


That last one would be W0(2)e. The first is optionally W0(1)e. The (deprecated) part 
of the pattern was designed by the same folks who add ~20% bandwidth (forget the exact 
number) to all mime email in order to get it through 7 bit smtp.

Barry





What good is our jargon? was: Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Barry Caplan

This is a fair question. Why is jargon useful? It serves to define a group and a 
concept. the best jargon is memorable, short in name, easy to write, catchy in sound 
to the ear, and universally able to be written. It helps a lot if the term is not 
already overridden by another group.

i18n and l10n both meet all of these criteria, as do lan and yahoo! and google. 
In this respect, jargon can become a brand.

What is really interesting to me is that the criteria we have as common lore about 
*why* abbreviations were needed (too long to write and type and too much of a tongue 
twister) apparently never occurred to other professions that also use 
internationalization and localization as key terms.

I think it is the ability to separate what we mean from what others mean that is an 
important value of the jargon. Especially since it is not always clear in context 
which is which, and also especially since globalization has extremely negative 
connotations in the popular collective mind.

Barry Caplan
www.i18n.com

At 05:12 PM 10/10/2002 -0700, Kenneth Whistler wrote:

 W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r!

What I don't understand, since these a10n's are in such
widespread use among programmers and character encoders,
is why they don't use h9l, as in i12n, lan, and gbn?

--K1n

BTW, these aan's are not only o5e, they are also o4e, but
unfortunately, not o6e in use.





[ot]Re: unsuscribe

2002-10-04 Thread Barry Caplan

I think I might put it on the list of things to do to patch all open source list 
management software so you have to triple opt-in: in addition to the usual, you have 
to confirm you read a message that contains nothing but unsubscribe instructions. 
Anyone wanna help? :)

Barry

At 06:51 AM 10/4/2002 -0700, you wrote:
Allow us to help you once more:

http://unicode.org/unicode/consortium/distlist.html#3

It contains the info on how to unsubscribe, and if you scroll down a bit it
gives information on what to do if you have problems unsubscribing.


MichKa





RE: The Currency Symbol of China

2002-10-01 Thread Barry Caplan

At 12:50 AM 10/1/2002 -0700, Ben Monroe wrote:
 For instance, IIRC, Isabella Bird wrote in her (British) English
travelogue in the early Meiji restoration era (1878 AD)
 of travels to Yedo (now commonly called Edo in the literature, and
known by its modern name to all as Tokyo). She called Tokyo Tokiyo.

Just a small correction. The Meiji Restoration was in 1867 (some
historians view it as 1868 though).

That's a timezone issue, right? :) Actually the 1878 date I referred to is the date of 
the travels discussed in the book, not the date of the Meiji Restoration. the book 
itself, according to my copy from about 100 years later, was first published in 1880.

Barry Caplan
www.i18n.com






Re: The Currency Symbol of China

2002-09-30 Thread Barry Caplan

At 10:08 PM 9/30/2002 +0200, you wrote:
Yen is an ancient on pronunciation for U+5186; today it's pronounced
en.

Stefan

Really? I have no sources either way, but I always assumed yen was a Western 
transliteration of en, since there is no ye entry in the kana table.

Barry Caplan
www.i18n.com





RE: The Currency Symbol of China

2002-09-30 Thread Barry Caplan

Wow ! I brought Ben out of lurk status after 6 months!

Interesting post too - my limited understanding goes back only to Heian era (~970-1100 
AD OTTOMH). That combined with various early western transliterations into what we now 
call romaji, before Hepburn became semi-standardized.

For instance, IIRC, Isabella Bird wrote in her (British) English travelogue in the 
early Meiji restoration era (1878 AD)  of travels to Yedo (now commonly called Edo 
in the literature, and known by its modern name to all as Tokyo). She called Tokyo 
Tokiyo. 

It is these types of early Western writings form Japan where I have seen the Ye 
used, but since they are also littered with plenty of other examples of weird 
transliterations, I just wrote it off to that.

I also think (but I could be wrong) that ye is not one of the characters in the 
famous Buddhist poem that uses each of the kana once and only once, and establishes a 
de facto sorting order by virtue of being the only such poem.

OTOH, I am pretty sure that poem is either from or post-dates the Heian era, so it 
wouldn't rule out your point.

Barry Caplan
www.i18n.com

At 03:16 PM 9/30/2002 -0700, you wrote:
Barry Caplan wrote:

 To: Stefan Persson; [EMAIL PROTECTED]
 At 10:08 PM 9/30/2002 +0200, you wrote:
 Yen is an ancient on pronunciation for U+5186; today it's 
 pronounced en.

 Really? I have no sources either way, but I always assumed 
 yen was a Western transliteration of en, since there is 
 no ye entry in the kana table.

Modern Japanese has 5 basic vowels, /a, i, u, e, o/.
Old Japanese most likely had 8 vowels, /a, i1, i2, u, e1, e2, o1, o2/.
These can further be traced to a proto-Japanese 4-vowel system /a, i, u,
o/.
In the y-line, there is currently /ya, yu, yo/. During the Nara period
where the first extant literature appears, there is evidence that the
man'yougana (precursor to modern kana; Chinese characters) regularly
distinguished between two types of /e/ (called Kou/Otu or A/B sounds,
among others). This is usually taken by most scholars as /e/ and /ye/.
By the early Heian period, with the emergence of the kana syllabary,
this Kou/Otu distinction vanished, specifically the /e/ and /ye/
distinction by around 938 AD. It is usually assumed that the /e/ and
/ye/ (which is written with /e/) merged into [ye] (or [je], if you
like). Notice that the Portuguese dictionary of 1603 spells this /e/ as
ye. Other documents indicate that this /e/ [ye] must have become [e]
(as modern) by 1775 or earlier. Also note that some dialects in Kyushu
still retain the [ye] pronunciation for /e/.

I do not really have the time to go into more details right now.
I hope this will suffice.

Ben Monroe





Re: Keys. (derives from Re: Sequences of combining characters.)

2002-09-28 Thread Barry Caplan
 U+003C in a way that makes using U+003C with the
meaning LESS-THAN SIGN in body text intermixed with markup sections awkward.
That feature of XML may not matter for situations involving encoding simply
literary works, yet for a comprehensive system which can include the U+003C
character with the meaning LESS-THAN SIGN in body text and in markup
parameters, it does not suit my need.


You may be under the mistaken impression that any but the tiniest amount of raw XML is 
ever edited by hand. If you think your message creators are going to create your 
files, XML or Comet Circumflex in the actual markup language, well, that just won't 
happen often in practice. A UI which handles, well, the User Interface, will be 
needed, making the choice of markup language moot until it comes to what other systems 
can accept.

It is not a fact that my proposed markup convention, as you call it, is not
a good idea.  It may be your opinion and it might perhaps be the opinion of
some other people.  Yet my proposed markup convention, as you call it, is
entirely within the rules, for keys generally, as in my original post, and
for my comet circumflex key in particular.

Know one is saying it is not valid Unicode. From a market acceptance point of view, 
you have seen a consensus that there are a lot of reasons why it probably is not a 
good idea, coming from people I know to have an enormous amount of experience in these 
specific matters upon which to draw these conclusions.

I for one would be interested if you could come up with some others whose opinion 
supports your own, although perhaps off list is the place for that.

Why should the discussion be taken elsewhere?  It is about the application
of Unicode to markup and of one particular application to language
translation in a manner where Unicode could be widely used, as the comet
circumflex system could be used with all of the languages which Unicode
supports.

Well, the moderator keeps letting it go on... if not I am willing to carry it ad 
infintitum on i18n.com - just click on Submit story on any page

Actually, I was rather hoping that, with your specific interest in languages
that you would have wished to have a try at using the comet circumflex
system as one of the features of the comet circumflex system is that it
could be used with minority languages as easily as with the major languages
of the world.

If I may speak for Peter, I think he would be willing to consider it were it XML 
based. 

However, I offer the caveat that you may be in for some rude surprises when you find 
out how hard it is to actually translate beyond the simplest sentences (and sometimes 
even that) when you parameterize them as you propose. I have been of the opinion for 
several years as far as localization goes, that it is better just to take out the 
parameters and list all the possibilities. 

Of course in the general case that may not be possible, but then you are in the realm 
of machine translation, which already exists for better or worse. So in your case, you 
may also need to make a case that your solution is more useful than just listing 
non-parameterized sentences, yet more likely to provide a useful translation than 
existing machine translation systems. Based on the example sentences about the weather 
in (London, Berlin, Tokyo) etc. from your original post, I would say that is a very 
open question.

Barry Caplan
www.i18n.coom





Re: Keys. (derives from Re: Sequences of combining characters.)

2002-09-28 Thread Barry Caplan
 otherwise, I can write a 5 
line perl program to run on a spare machine that will create prior art of every 
possible combination of characters.. I can let it run forever and hook it to a web 
server to make it visible too.

An added bonus of using the comet circumflex key is that documents
containing comet circumflex codes do not necessarily need to contain any
characters from the Latin alphabet.

Why is this a bonus, let alone an added one? I have a 4 year old niece just learning 
the latin alphabet and as far as I can tell it hasn't changed since I learned it. 
There is no +U003C character in that alphabet.

In fact, the bonus of using 3C as a delimiter (along with other XML delimiters) is 
that they are in every legacy encoding, meaning if no Unicode tools are available for 
editing, a regular text editor can be used and the conversion to Unicode can happen 
later.

Your method requires Unicode support and fonts (not the same thing) at the editing 
stage, which is not realistic unless you want to limit your community to a few of your 
closest friends so to speak.

No one is suggesting such a system can't be built, only that its usefulness would be 
strongly limited for a lot of very good reasons. As others have noted, I concur that 
this is not really a Unicode issue per se, but a software design issue.

Barry Caplan





Re: Keys. (derives from Re: Sequences of combining characters.)

2002-09-27 Thread Barry Caplan

At 04:26 PM 9/27/2002 +0100, William Overington wrote:
I had not heard the description Message catalog previously, so I can
search for that too.

I have previously searched under telegraphic code and language and
translation.

An email correspondent drew my attention to the following list of numbered
I have not yet found any example oriented to language translation.  

Key Unix libraries have used message catalogs as part of the API since time 
immemorial. Hence any Unix application with even a whiff of a chance of being 
internationalized is likely to have used those functions.


I have
not yet found any example oriented to carrying on a complete conversation.

I would look for the earliest references to machine translation int he 1940s and 50s, 
up to the work with Eliza at MIT in the 60s. I think there is an enormous project 
whose name I don't recall right now going on in Texas, perhaps Austin, which is 
spiritually derived from Eliza and focused on sending whole, previous composed 
sentences back conversational style.

If you want to find the whole of the literature in this area, I suggest searching 
Turing Test.


A proprietary coding system is a bad idea.

Well, it depends what one is trying to do.  If one wishes to establish a
system whereby proprietary intellectual property rights exist, then a
proprietary coding can be a good idea.  Various large companies use
proprietary coding systems for files used with their software packages.  If,
however, one is trying to establish an open system, then you might well be
right.

Or if you want to minimize the amount of reinventing the wheel you do internally. You 
can easily use a proprietary format outside and XML inside, just as you can use SJIS 
outside and Unicode for internal processing.


Failure to investigate the state of the art, (especially where google is
so effortless), means this idea is not pushing any envelope.

Well, if you have any specific suggestions of what keywords to use in a
search, that would be very helpful.


I have given you some. Rather than focusing on pseudo-scientific terms like 
radiogram, I suggest a starting with a familiarity with the history of computer 
science, both pure and applied research.


The keys idea is pushing the envelope.  


No it is not. 

As spin off from this discussion,
maybe the XML people, and the Unicode Technical Committee, will do something
about having special characters for the XML tags rather than using U+003C
and thereby help people wanting to place mathematics and software listings
in the same file as markup.  Is using U+003C a legacy from ASCII days?

Why is it not possible to use  signs in XML? 


Most of my postings in this thread are in response to people asking me
specific questions and raising interesting points.  That is surely why a
discussion group exists.

But most of the answers you get are based on a shared technical and educational 
background which you don't have and/or seem to value. It is difficult to describe but 
a lot of early computer science research was about how to effectively decompose 
functionality and data. Sadly, I think  a lot of this is being lost. For a more 
technical starting point, look for the works of Edsger Dijkstra starting in the 1960s. 
For a less technical point of view, look for The mythical Man-month from the mid 60s 
(recently updated), and its spiritual followups by Ed Yourdon and Tom Demarco. 

When I read the responses you get, I have the feeling that the authors have 
internalized the lessons of these important texts (even if they may not have studied 
them explicitly). Once you internalize the lessons also, then you will have a better 
understanding of the points of view you are consistently receiving with friction.


I am hoping that I can publish some web pages with some comet circumflex
codes and sentences about asking about the weather conditions and
temperatures at the message recipients location together with codes and
sentences for making replies so that hopefully people who might be
interested in some concept proving experiments can hopefully have a go at
some fascinating experiments with this technology.  Unicode can be used to
encode many langauges and it will be interesting to find out what can be
achieved using the comet circumflex system.

That might be an interesting web site in its own right, but the technology is nothing 
special and has ben done a million times under a million names and ten million times 
with no name at all.

Barry Caplan
Publisher, www.i18n.com





Re: glyph selection for Unicode in browsers

2002-09-26 Thread Barry Caplan

At 02:59 PM 9/26/2002 -0400, Tex Texin wrote:
Shouldn't that be something more like: pan-script Unicode-based font?


or p8e font? :)

Barry Caplan
www.i18n.com





Re: no replies

2002-09-25 Thread Barry Caplan

Roslyn,

I will head off trouble because for you because your message is likely to be otherwise 
ignored or semi-flamed.

The best place to get information on compiling and configuring php is on a php support 
or developer list. There must be information on how to subscribe to such lists on the 
php home page, which I am guessing is php.org.

Another great source to find answers that I use at least 10 times a day with a 90%+ 
success rate is to search on related keywords on google.com and groups.google.com. 
OTTOMH, in your case I would try searching php enable-mbstring inthose places and 
see what you find.

This list is for questions related to Unicode. That is probably no one has replied 
previously. Few if any people here are php developers, and even fewer are going to be 
versed in the details of configuring and compiling php.

Hope this helps!

Barry Caplan
www.i18n.com


At 04:35 AM 9/24/2002 -0700, you wrote:

aaah finally, one reply to that question!! thankyou BOB. anyways, could anyone tell 
me how i can recompile php to include mbstring support. i used the ./configure 
enable-mbstring option,did the make install..etc etc, but i still can seem to run any 
of the mbstring functions in my php code, i get fatal error: call to undefined 
function mb_(whatever)...could anyone pls assist me here. thanks 

regards, 

roslyn 





Re: about starting off

2002-09-19 Thread Barry Caplan

Roslyn,

I am working on a postgres database too - I haven't yet gotten to extensively testing 
the unicode aspects, but be sure to set the character set of the database to unicode 
when you create it. Otherwise all is probably lost - I don't know that you can simply 
change the char set later, and if you have to dump and import the data, you'd have to 
do some sort of conversions. Why bother making extra work for yourself?


As for the code in php (I am using Perl myself and something similar applies) every 
time you manipulate text (every time!) get used to asking yourself if you (or php) are 
making any assumptions that one byte is the same as one character. The answer needs to 
be no, but will often be yes. Reconciling these issues is the bulk of making Unicode 
work for you.

Barry Caplan
Publisher, www.i18n.com


On Thu, 19 Sep 2002, roslyn jose wrote:


 hi,

 im new to unicode, and am working on a project in php/postgresql. i need
 some info on how to start off with unicode. i went thro the web site and
 only saw explanations on what it is, its char set,etc. do i need to
 download or install anything to work with unicode, pls let me know soon.
 and also once downloaded do i need to import any classes or files when
 working with it, as im scripting in php and html. thanx

 regards,

 roslyn






Re: Why w and y are not vowels? [Was: Re: Latin vowels?]

2002-09-09 Thread Barry Caplan

At 04:37 PM 9/9/2002 -0400, John Cowan wrote:
 [Da][n] [Ko][ga][i], 5 Japanese Syllables, 3 English Syllables

5 moras, 3 syllables, actually.

A new vocabulary word for me, so I looked it up...

mo·ra 
n. pl. mo·rae  or mo·ras 
The minimal unit of metrical time in quantitative verse, equal to the short syllable.

How does this apply unless I write something like?

 I think that I shall never see
a Kogai lovely as a tree 

Mora sounds like jargon for a more specialized situation, unless I am missing 
something ...


Barry Caplan
http://www.i18n.com





Forwarded question....

2002-08-29 Thread Barry Caplan

Hi Unicoders...


I received this question and I didn't have a good answer ...perhaps someone else here 
can help?

I have a Japanese text file in Shift JIS and I need
to convert it to escaped Unicode. 

Does anyone know of any tools or utilities that can do this?

The standard character encoding sets available in
text editing tools like Hidemaru don't appear to do this.

Any suggestions would be helpful.

Thank you.

By escaped Unicode, she means \u format.

Barry Caplan
http://www.i18n.com





Re: Revised proposal for Missing character glyph

2002-08-26 Thread Barry Caplan

At 09:49 PM 8/26/2002 -0400, John Cowan wrote:
Nowadays, experts can detect mismatched character sets from the
nature of the byte barf that appears on their screen.

And super-experts can read languages in byte barf as it is not random!

Barry Caplan
http://www.i18n.com





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Barry Caplan



Yes, yes, I think this is an idea which could fly.

--Ken


Good.  It is a solution which could be very useful for people writing
programs in Java, Pascal and C and so on which programs take in plain text
files and process them for such purposes as producing a desktop publishing
package.


Uhh, I think Ken's message was entirely sarcasm or some higher form of rhetorical 
humor whose obscure name slips my mind right now.

The suggestion to use html as an extension was the give away - I was laughing out 
loud from that point on - his point was that the technology to do what you want 
already exists it is called HTML and it is displayed by browsers and so forth.

Barry Caplan
www.i18n.com





OT Laugh for the day - I liked the title of this security related article

2002-08-09 Thread Barry Caplan

and the first few sentences as well


Barry Caplan
www.i18n.com

http://www.securitymanagement.com/library/000599.html


How to Keep Out Bad Characters

By DeQuendre Neeley

The business world is one of constant motion. But it is not just people who are on the 
move. It is also information. Businesses today depend on the efficient exchange of 
information, for which they rely increasingly on the Internet and other computer 
networks. Unfortunately, in the digital world, as in its physical counterpart, bad 
characters will sometimes try to slip in with the good. 





Re: The standard disclaimer

2002-07-25 Thread Barry Caplan

At 10:08 PM 7/24/2002 -0700, Doug Ewell wrote:
Tex Texin tex at i18nguy dot com wrote:

 Hall?
 Check?
 Re- ?
 Water?

No, too late.  John Hudson already won this round, for finding a way to
bring it back on topic.  (Turns to John and bows, Pat Morita style.)
Congratulations, master.


And for that we give him high - 

Barry Caplan
www.i18n.com





Re: Unicode certification - was RE: Dublin Conference:

2002-07-25 Thread Barry Caplan

At 08:07 AM 7/25/2002 -0700, David Possin wrote:
After that we can add the chocolate sauce, the cherry, and the
sprinkles of Unicode. The special Unicode compliance tests are harder
to define and to perform, I agree. But in most cases these issues
haven't even been implemented yet.


But isn't the reason someone would want to quantify compliance is precisely to find 
out what is implemented and what is not?

Barry Caplan
www.i18n.com





Re: Abstract character?

2002-07-22 Thread Barry Caplan

I usually define an abstract character in talks I give as an element of a writing 
system that you care about, independent of glyphs, and certainly independent of 
endings or specific code points. 

If it could be described more precisely than that, it wouldn't be abstract, would 
it? :)

This is usually brought up in a series of definitions  leading from character (what 
we are referring to here as abstract character, and then:

- character list - a list of characters one is interested in
- character set - a list of character lists, which may or may not be ordered, but 
still has no codepoints
- encoding scheme - an algorithm for assigning code points to a character set
- code point the representation of an abstract character in an encoding scheme
- font - a series of glyphs that are used to display a characters represented by 
code points, in their immediate context

All of this is filled with examples - building to an explanation of Unicode. For 
example, wrt abstract character, I ask the audience to ponder if upper case A and 
lower case a, are the same abstract character. Also, I ask them to ponder if 
lower case a displayed in Helvetica is the same character as lower case a in  
Times Roman. Finally, how about  lower case a in 9 point Helvetica and lower case 
a in 18 point Helvetica?

And apropos a thread from last week, Unicode introduces new concepts such as 
character properties which means the anticipation and intrigue I spend time building 
in the audience that there is a neat solution to the historical morass I just spent 40 
minutes describing, gets thoroughly dashed! Joy!

Implicit in this set of definitions is of course that a character may or may not be 
of interest to all character lists, and therefore may or may not end of represented 
in more than one encoding. Also note that even when it does end up in more than one, 
this model in no way implies a round trip capability.

This leads nicely into a discussion about some very important aspects of 
internationalizing code and working with 3rd party components..

Barry Caplan
www.i18n.com

At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote:
Lars Marius Garshol asked:

 I'm trying to find out what an abstract character is. I've been
 looking at chapter 3 of Unicode 3.0, without really achieving
 enlightenment. 
 
 The term Unicode scalar value (apparently synonymous with code point)
 seems clear. It is the identifying number assigned to assigned
 Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10; a point in the codespace 0..10.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   assigned to an abstract character and an abstract
   character is mapped to a code point.

   This is distinguished from the vaguer sense of assigned
   in general parlance as meaning a code point given some
   designated function by the standard, which would include
   noncharacters and surrogates.

 
 So far, so good. Some questions:
 
  - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between assigned and
designated, which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

 
  - it seems that not all abstract characters have code points (since
abstract characters can be formed using combining characters). Is
that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard

RE: Inappropriate Proposals FAQ

2002-07-12 Thread Barry Caplan

At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
Unicode is a character set. Period. 


Well, maybe. But in a much broader sense then the character sets it subsumes in its 
listings. Each character has numerous properties in Unicode, whereas they generally 
don't in legacy character sets.

Maybe Unicode is more of a shared set of rules that apply to low level data structures 
surrounding text and its algorithms then a character set.

The Unicode consortium very wisely keeps it's focus narrow. It provides
a mechanism for specifying characters. Not for manipulating them, not
for describing them, not for making them twinkle.

All true, except for some special cases (BOM, bidi issues and algoirthms, vertical 
variants, etc).Not saying those shouldn't be in there, just that they are useful only 
in the use of algorithms that are explicit (bi-di) or assumed (upper case/lower case, 
vertical/horizontal) etc.

In many cases, these algorthms are not well known, even amongst the cognoscenti, or 
generally available in nice libraries. Anyone for an open source Japanese word 
splitting library (I know not taking a look at ICU before I press send is going to 
come back to haunt me on this, but if it is in there, then substitute something that 
isn't :)

Barry Caplan
www.i18n.com





RE: Saying characters out loud (derives from hash, pound,octothor pe?)

2002-07-12 Thread Barry Caplan

At 09:43 AM 7/12/2002 -0400, Suzanne M. Topping wrote:

 -Original Message-
 From: David Possin [mailto:[EMAIL PROTECTED]]
 
 so now we have a chromatic audio attribute for each character?

Don't be ridiculous. Sounds don't have chroma. 

There will however be a need for tone and accent variation so that
proper localization can be executed. 

;^P

I have been dreaming of the idea of synaesthetic applications for years but haven't 
come up with a way to do it yet. But sounds absolutely will need chroma, that much I 
know. And when you say it with feeling, the fonts will literally be perceived as 
feeling

Such an application better not be written for Windows, because the blue screen of 
death will be felt rather than seen :)

Barry Caplan
www.i18n.com





Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Barry Caplan

At 05:13 PM 7/12/2002 -0400, Suzanne M. Topping wrote:
 -Original Message-
 From: Barry Caplan [mailto:[EMAIL PROTECTED]]
 
 At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
 Unicode is a character set. Period. 
 
 Each character has numerous 
 properties in Unicode, whereas they generally don't in legacy 
 character sets.

Each character, or some characters?


For all intents and purposes, each character. Chapter 4.5 of my Unicode 3.0 book says 
 The Unicode Character Database on the CDROM defines a General Category for all 
Unicode characters

So, each character has at least one attribute. One could easily say that each 
character also has an attribute for isUpperCase of either true of false, and so on.

There are no corresponding features in other character sets usually.


 Maybe Unicode is more of a shared set of rules that apply to 
 low level data structures surrounding text and its algorithms 
 then a character set.

Sounds like the start of a philosophical debate. 

Not really. I have been giving presentations for years, and I have seen many others 
give similar presentations. A common definition of character set is a list of 
character you are interested in assigned to codepoints. That fits most legacy 
character sets pretty well, but Unicode is sooo much more than that.



If Unicode is described as a set of rules, we'll be in a world of hurt.


Yeah, one of the heaviest books I own is Unicode 3.0. I keep it on a low shelf so the 
book of rules describing Unicode doesn't fall on me for just that reason. this is 
earthquake country after all  :)


I choose to look at this stuff as the exceptions that make the rule.


I don't really know if it is possible to break down Unicode into more fundamental 
units if you started over. Its complexity is inherent in the nature of the task. My 
own interest is more in getting things done with data and algorithms that use the type 
of material represented by the Unicode standard, more so than the arcania of the 
standard itself. So it doesn't bother me so much that there are exceptions - as long 
as we have the exceptions that everyone agrees on, that is fine by me because it means 
my data and at least some of my algorithms are likely to be preservable across systems.


(On a serious note, these exceptions are exactly what make writing some
sort of is and isn't FAQ pretty darned hard. 

humor
Be careful what you ask for :)
/humor

I can't very well say
that Unicode manipulates characters given certain historical/legacy
conditions and under duress. 

Why not? It is true.

But what if we took a look at it from a different point of view, that the standard is 
a agreed upon set of rules and building blocks for text oriented algorithms? Would 
people start to publish algorithms that extend on the base data provided so we don't 
have to reinvent wheels all the time?

I'm just brainstorming here, this is all just coming to me now. 

If I were to stand in front of a college comp sci class, where the future is all ahead 
of the students, what proportion of time would I want to invest in how much they knew 
about legacy encodings versus how much I could inspire them to build from and extend 
what Unicode provides them?

Seriously, most of the folks on this list that I know personally, and I include myself 
in this category, are approaching or past the halfway point in our careers. What would 
we want the folks who are just starting their careers now to know about Unicode and do 
with it by the time they reach the end of theirs, long after we have stopped working?

For many applications, people are not going to specialize in i18n/l10n issues. They 
need to know what the appropriate building text based blocks are, and how they can 
expand on them while still building whatever they are working on.

Unicode at least hints at this with the bidi algorothm. Moving forward should other 
algorithms be codified into Unicode, or as separate standards or defacto standards? I 
am thinking of Japanese word splitting algorithm. There are proprietary products 
that do this today with reasonable but not perfect results. Are they good enough that 
the rules can be encoded into a standard? If so, then someone would build an open 
implementation, and then there would always be this building block available for 
people to use.

I am sure everyone on this list can think of their own favorite algorithms of this 
type, based on the part of Unicode that interests you the most. My point is that the 
raw information already in unicode *does* suggest the next level of usage, and the 
repeated newbie questions that inspired this thread suggest the need for a 
comprehensive solution at a higher level then a character set provides. Maybe part of 
this means including or at least facilitating the description of lowlevel text 
handling algorithms.

If I did, people would be scurrying around
trying to figure out how to foment the duress.)


The accomplishments of the Unicode

Re: What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Barry Caplan

At 03:54 PM 7/12/2002 -0700, Kenneth Whistler wrote:
Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.

O.k., so now before asserting or denying that Unicode ... is
a shared set of rules, it would be helpful to pin down
first what you are referring to. That might make the ensuing
debate more fruitful.
Actually, it was me, not Suzanne, that called Unicode a shared set of rules. As 
Ferris Bueller once said I'll take the heat for this. 

I was aware of all of the uses of Unicode that you listed. I have no quarrels with any 
of them. They do point to the fact that the word is overloaded with definitions. Which 
means that readers have to choose the appropriate one from the context. The context of 
the statement above is that the Unicode referred to is the Standard, and all 
associated documentation. Not Unicode the Consortia which manages the Standard. Not 
Unicode the way of life :)

I did intend to throw open a debate about the long term future of Unicode the Standard 
and by extension Unicode the Consortia. Since Suzanne is writing What is Unicode and 
is not Unicode FAQ, I think the answer to that is going to be very definitely colored 
by the answer to the related question What will Unicode become?, e.g. Unicode 6.0, 
7.0, 8.0, etc. 

See my previous msg, subject line: Hmm, this evolved into an editorial when I wasn't 
looking :)  for some thoughts on that subject.


Barry Caplan
www.i18n.com





Re: Q: Filesystem Encoding

2002-07-10 Thread Barry Caplan

At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote:
 In short: should I still stick to ASCII alone in filenames, or are there
 filesystems where I really don't have to anymore? Thanks in advance.

  Definitely/unconditionally no for NTFS. As for Linux ext2(and most other
Unix fs'), unless you mix up UTF-8 and legacy encodings (which you
wouldn't because you have never used non-ASCII), it's all right to switch
to UTF-8 and use non-ASCII chars.

But be aware that such filenames may or may not be able to be transferred *across* 
file systems.
Not only that, but, although I haven't tested in detail for a while, I would not be 
fully comfortable with middleware that is responsible for managing file names across 
systems either, such as FTP, email attachments,  and Samba. Particularly in the case 
of FTP and email, just because one client works does not mean another one will.


Also keep in mind that even if the file name transfers exactly correct, there is no 
guarantee, except, for ASCII characters, that the system will have fonts to display 
the file name.

Barry Caplan
www.i18n.com





Re: Saying characters out loud (derives from hash, pound, octothorpe?)

2002-07-08 Thread Barry Caplan

At 11:37 AM 7/5/2002 +0100, Michael Everson wrote:
Also, how does one say the U+007E character out loud while reading out the
address of a web page?

Tilde. Get real, William.


FF5E is colloquially known as a wave in Japanese, IIRC, and hence 007E is a small 
wave or half width wave.

Barry Caplan
www.i18n.com





Re: Inappropriate Proposals FAQ

2002-07-02 Thread Barry Caplan

At 10:01 AM 7/2/2002 -0400, Suzanne M. Topping wrote:
I have a few ideas for fictional proposals to use as examples (my room
layout idea, and Mark's 3-D Mr. Potato Head representation), but I could
use another one or two if anyone feels creative. The closer to being
believable, the better, I suppose. (An alternative would be to use
real-life proposals, and state why they were not accepted, but I thought
it more politic to keep it fictional...)


There was a discussion last year about a symbol to represent pi/2 or pi/4 or something 
like that. If you want to fictionalize that to some other fraction of a mathematical 
constant, that might work (e/2 perhaps?)

Barry Caplan
www.i18n.com





Re: Creative IDN Opportunities

2002-06-20 Thread Barry Caplan

I think it is somehow tied into the whole ICANN political mess. I haven't sorted it 
out yet but I am interested if anyone else has...


Barry Caplan
www.i18n.com

At 02:13 PM 6/20/2002 -0400, Suzanne M. Topping wrote:
Couldn't help but cringe at the last line of this press release.

Can anyone give me a quick update on the status of IDN standards work?
It's been a while since I checked it out...





Re: Support for Japanese characters

2002-03-08 Thread Barry Caplan

At 12:21 PM 3/8/2002 -0600, Eric Ray wrote:
Need help
please. 

Problem: 

1. Current library built for unix and supports ASCII
characters only. 

2. This library must now accept wide characters from
Japanese client.
You need to doublebyte enable the library except for the most trivial
uses. Doing so is not trivial.


Facts:
--
1. The library does not really evaluate the Japanese
characters to make logical decisions. 
If the data just passes through, that might be relatively
trivial.

We believe base64
encode the character array to avoid any bad things happening in the
code (such as hitting a null value or other values that could
potential cause problems).
Is the (non-Japanese) data already base 64 encoded? If so, why? Why
create trouble handling that just to avoid checking for null values?
Anyway, if you really aren't going to process the Japanese characters in
this library except to pass them thru, then you need to take the Japanese
text, base64 encode it, and then pass it to the library the usual way.
Then retrieve it the usual way and base64 unencode and voila!
Of course this may just move your questions to other parts of your
program, but you haven't asked about those places. without knowing what
the application is or what the configuration is except unix
it is hard to say more.

2. Cannot rewrite library in time allowed and don't
really need to based on Fact item #1. Plus, pressure to get product
to market is greater than internationalizing the
library.
This is probably a guaranteed method to fail in Japan. Japanese users and
your Japanese partners if you have them have had many years of
experience with bad software form the us that claims to work. They will
know how to break it quickly. Then you will learn a hard lesson
about doing business with Japanese while not taking heed of the well
known requirement for quality.



What I need help with:
--
1. How do I set up an ASCII based unix machine, test
application and test environment to send Japanese characters to the
library in question.
I see from your web site that the application is likely some sort
of encryption device, possibly for email. Having run the Japanese
software group at an email company in the past,I can tell you Japanese
email is fraught with its own perils under any circumstances.
Without knowing what the actual channel is that you want to pass the text
thru, it is hard to say how you will want to test it.
You also have not described the time schedule and why you consider it
tight. Is it safe to assume that your plan to counteract any lack of
experience and time schedule is to spend money to hire someone who has
both?

2. Do I need to create hex input or binary input to
represent Japanese characters. Since I'm using a standard keyboard
how do we get Japanese characters into the
application?
Use the Japanese Input Method Editor supplied with or for the
operating system. But that does not guarantee that the data will actually
get to the application properly if the application has not been coded to
handle it. This is part of internationalizing your code, and now you see
why skipping corners during the initial development is coming back to
haunt you.

3. What am I not considering here? What gotchas
will I come across by not making my library
i18nized?
The gotchas are going to fall into the categories of Won't
work or Data passes thru ok, but the rest of the application
doesn't know how to handle it. OTTOMH, I would watch out for
endianness when you base64 encode Japanese multibyte text too. Probably
OK, but worth taking a close look at.


Unfortunately, I've never done any i18n or l10n work before
so I'm really having trouble figuring out where and how to get
started. Any advice is appreciated.
There is no magic bullet here in general. if Zixit values the opportunity
in Japan, I would suggest you be open to the offers you are sure to get
from experienced folks to assist you. If you don't get any, contact me
off-list and I will put you in touch with some.

Barry Caplan
Publisher,
www.i18n.com



Re: [OT beyond any repair] House numbers

2002-03-04 Thread Barry Caplan

At 01:16 PM 3/1/2002 -0500, John Cowan wrote:
What about the 100 house numbers per block convention?
This does not hold in the older parts of older US cities
(New York does not obey it south of 8th St or so),
but is quite general in the US as a whole


It holds for the whole of Baltimore and extends on at least the major arteries into 
the suburbs Some suburbs reset the count from their own city centers, and that may or 
may not include the main arteries I am not aware of any exceptions at all in 
Baltimore city Note that the main arteries are more or less in an spoke from the 
center of downtown All blocks are numbered form the hub (baltimore (east/west) at 
charles (north/south) Thus all 2800 blocks are roughly equidistant form the center 
It is less well known that even numbers are on the left as you head out of town in any 
direction and odd numbers on the right

Anyone who wants to reach me by snail (extremely snail)
mail, can do so at:

Cowan
12017-0042
USA

Doesn't every address that USPS delivers to have a unique 9 digit zip code, making 
house numbers a legacy? From the US, couldn't I get a letter to you just by putting 
12017-0042 on the envelope?


Barry Caplan
Publisher, wwwi18ncom






Need a quick font? make your own!

2002-02-28 Thread Barry Caplan

This is pretty interesting. Is it art, is it a toy? Make your own TT
fonts created by a genetic algorithm!
http://alphabet.tmema.org/


Best Regards,
Barry Caplan
www.i18n.com
- coming soon, preview available now
News | Tools | Process for Global Software
Team I18N



Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Barry Caplan

This was discussed in a book I recently read, called Code (don't recall the 
author right now). Apparently the Danish (I think) translation has an 
error, but only one. I guess the proof reader was not familiar with grep :)

Barry


At 08:23 AM 2/14/2003 -0500, Elliotte Rusty Harold wrote:
At 11:59 PM -0500 2/13/02, John Cowan wrote:
There is an English translation (or translation): The Void,
wherein the hero, Anton Voyl, becomes Anton Vowl.  There are German
and Danish translations too.

Do you happen to know if these translations also avoid the letter e? 
German's especially impressive since I think e makes up about 20% of the 
letters in typical German.
--

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible, 2nd Edition (Hungry Minds, 2001)   |
|  http://www.ibiblio.org/xml/books/bible2/  |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+--+-+





Re: Unicode and Security

2002-02-08 Thread Barry Caplan


At 15:53 -0500 2002-02-07, Elliotte Rusty Harold wrote:
For text files, probably not. But for the domain name system the world 
very well might. Indeed, maybe it should unless this problem can be dealt 
with. I suspect it can be dealt with by prohibiting script mixing in 
domain names (e.g. each component of the name must be entirely Greek or 
entirely Cyrillic or entirely Latin etc. Note: 
something_Cyrillic.something_greek.com is OK.)  Does anybody really need 
mixed Latin and Greek domain names?



Not only that, why limit the alleged security risks to domain names? Why 
not the part of an email address before the @? the allowed characters for 
that are specified in a different RFC than that for domain names, and has 
nothing to do at all with DNS.

And how many variations of numerals are there in Unicode? After all, every 
place you could use a domain name, you could use the actual IP address too. 
How many ways might that be spoofed?

Barry






RE: Unicode and Security: Domain Names

2002-02-08 Thread Barry Caplan

I want to review these documents, but since time is short, maybe someone 
can answer my question...

Are the actual domain names as stored in the DB going to be canonical 
normalized Unicode strings? It seems this would go a long way towards 
preventing spoofing ... no one would be allowed to register a non-canonical 
normalized domain name. Then, a resolver would be required to normalize any 
request string before the actual resolve.

So my questions are:

1 - Am I way off base here? If so, why?
2 - If not, is it already addressed in these docs?
3 - If it is not in the docs, and the request makes sense, then I will make 
the effort to beat the deadline, which is next Monday.


Thanks!

Barry

At 10:37 AM 2/8/2002 -0800, Yves Arrouye wrote:
Moreover, the IDN WG documents are in final call, so if you have comments to
make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe
(with a hyphen here so that listar does not interpret my post as a command!)
to their mailing list (and read their archives) before doing so.

The documents in last call are:

1. Internationalizing Domain Names in Applications (IDNA)
http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt

2. Stringprep Profile for Internationalized Host Names
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt

3. Punycode version 0.3.3
http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt

4. Preparation of Internationalized Strings (stringprep)
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt

and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little
time left.

YA





Re: Unicode and Security

2002-02-07 Thread Barry Caplan

At 12:22 PM 2/7/2002 -0500, Elliotte Rusty Harold wrote:
I've been thinking about security issues in Unicode, and I've come up with 
one that's quite scary and worse than any I've heard before. It uses only 
plaintext, no fonts involved, doesn't require buggy software, and works 
over e-mail instead of the Web. All it requires added to the existing 
infrastructure is internationalized domain names. So in the hope that this 
becomes a self-defeating prophecy, here's the scenario:

snipCan you please update me on your budget? Bob, noticing that the 
e-mail appears to come from Alice, whom he knows and trusts, fires off a 
reply with his confidential information. Only it doesn't go to Alice. It 
goes to me. I can then reply to Bob, asking for clarification or more 
details. I can ask him to attach the latest build of his software. I can 
carry on a conversation in which Bob believes me to be Alice and spills 
his guts. This is very, very bad.


This is precisely the problem digital signing is meant to solve. Signing 
means that Alice has encrypted the message with her private key before 
sending to Bob. Bob then unencrypts the message using Alice's public key. 
If the message does not unencrypt, then Bob should not trust that the 
message is from Alice. This algorithm works independent of transport 
mechanism (email, etc.), or domains. Alice's key stays with Alice,not with 
the domain. Of course, how you exchange trusted keys in the first place is 
another matter, but I am sure this is all covered on a security FAQ somewhere.


E-mail forgery has been a problem for a long time, but it's always been 
one-way. You couldn't trick somebody into sending you a reply because 
doing so required using a different e-mail address than the one they 
expected, thus revealing the message as forged.

There are many many ways to get a response from someone via email, even if 
the address is not recognized or forged. Most involve social engineering 
approaches more than anything else. My mailbox filled with spam will attest 
the that!


With a Unicode enabled mailer, that's no longer true. If the fonts Bob 
(not me, but Bob) chooses for his e-mail program do not make a clear 
distinction between an o and an omicron, this works. There are lots of 
other attacks. The Cyrillic and Greek alphabets provide lots of options 
for replacing single letters in Latin domain names.


Unless all messages are signed (technically feasible) , then there is no 
trust at all. When Outlook/Exchange supports, in fact requires, messages to 
be signed, then this problem will start to dwindle away, at least in the 
email realm.

Of course if there is a method to judge the level of trust for properly 
signed messages that arrive from folks you don't know (a human 
failability), then knowing the origin of the message might not help much 
either. My inbound spam can be verifiably signed, but it is still spam.

In other words, it's not our fault. Blame the client software. Sounds 
distressingly like the Unicode Consortium's approach to these issues. 
Interestingly, my attack works with a single character representation 
(Unicode).


Your attack is only a social engineering attack, not a technical weakness 
inherent in any protocol, or character set (even though there may be such 
issues)

Barry





Re: Unicode and Security

2002-02-07 Thread Barry Caplan

At 02:42 PM 2/7/2002 -0500, Elliotte Rusty Harold wrote:
At 11:34 AM -0800 2/7/02, Asmus Freytag wrote:

But, as the discussion shows, spoofing on the word level (.com
for .gov) is alive and well, and supported by any character set
whatsoever. For that reason, it seems to promise little gain to
try to chase the holy grail of a multilingual character set that
somehow avoids the character level spoofing, if the word level
spoofing can go on unchecked.

Burglary at the broken window level is alive and well. Therefore there's 
little point to putting locks on doors.

I hope the fallacy of the above is obvious, but when translated into the 
computer security domain it's all too common a rationalization, as this 
thread demonstrates.

It is not obvious to me that there is a fallacy at all, let alone what it 
is. Instead of stating that we should be able to infer the fallacy, please 
state it, and a possible solution explicitly.

It seems to me we have already proposed working, and available (if not 
elegant) solutions to the issue of trust of content.

Now the issue seems to be trust of domain names.

My browser already has built in support for identifying groups of domains I 
can assign varying levels of trust to, base on certificate technology. NOt 
elegant, but available.

Similarly, something for email could e done using today's technology.

More importantly, wrt DNS: under what circumstances can you, today, or in 
the future, actually trust that the address resolving information you get 
is accurate? None, really. The packets go too many places on the way that 
could change them. And even if it is accurate, which of course it usually 
is, how can you be sure that packets at a lower level will actually be 
delivered, as intended, and not misdirected or copied elsewhere? You can't, 
really, for the same reason. This is the nature of the system, especially 
at the IP level. None of this has to the slightest bit to do with what 
characters are used for domain names, and hence will not go away with any 
changes to DNS. It has everything to do with why data should be encrypted 
if you care about security of data.


There are many ways to socially engineer someone into doing something they 
shouldn't do. This is just one of them, and one that's mostly theoretical 
at the current time. However, we still need to plug the hole.


That there are other, less damaging holes (or even more damaging ones) is 
no excuse for not fixing this one.



The source code for bind is available. Go ahead and fix it. good luck 
persuading people to upgrade such a mission critical part of the internet 
though.


Just to pull a number out of a hat, imagine there are 10,000 attacks a day 
using spoofing in the current system. Is this any justification for 
opening up a hole that will add 10,000 more? Of course it's not.


I still don't see the attack as anything but social engineering. That a 
telemarketer or door-to-door salesman can get my credit card info by 
misrepresenting their intent does not mean there is a flaw in either the 
phone numbering scheme, or the credit card system. Your attack is exactly 
analogous.

Barry





Re: Unicode and Security

2002-02-07 Thread Barry Caplan

At 04:17 AM 2/8/2002 +0330, Roozbeh Pournader wrote:
On Thu, 7 Feb 2002, Elliotte Rusty
Harold wrote:
 Trust is a human question decided by human beings, not a boolean
answer
 that comes out of a computer algorithm. I can trust that the message
I'm
 replying to came from a person named Barry Caplan even
if I have no
 proof of that whatsoever.
Or that the book you're reading has been written by a person named 
Nicolas Bourbaki...
(Sorry, I love the idea. I could not stop myself.)
roozbeh
On what basis can Elliotte know that a message purported to
be from Barry Caplan actually is from Barry
Caplan, or that there even is a Barry Caplan? The
person writing this, who claims to be Barry Caplan, has never
met anyone named Elliotte Rusty Harold to the best of his
recollection. He (Barry Caplan) does claim to personally be
acquainted with many others on this list though - hi - sorry I missed you
in DC! :)

Best Regards,
Barry Caplan
www.i18n.com
- coming soon, preview available now
News | Tools | Process for Global Software
Team I18N



Re: Unicode and Security

2002-02-06 Thread Barry Caplan

At 11:54 AM 2/6/2002 -0700, John H. Jenkins wrote:
The original focus was on digital signatures, and I still don't get the 
objection.  Because I don't know *precisely* what bytes Microsoft Word or 
Adobe Acrobat use, do I refuse to sign documents they create?  Is that the 
idea?  I mean, good heavens, I don't even know *precisely* what bytes Mail.
app is going to use for this email.  Should I refuse to sign it?


I don't think the main issue is whether or not you should sign it. I think 
the main issue the original poster tired to raise, is that as the recipient 
of such a signed document, he is not persuaded he should trust it.

This is a serious issue, although as several have noted, not a Unicode-only 
one. No one doubts the security of the encryption algorithms used for 
signing. But the issue of trust is critical.

In the analog world, people are expected read and understand documents, and 
in general, the worlds legal systems are set up to recognize that a 
signature (or stamp or seal or whatever) is binding evidence that such care 
was taken (even if it wasn't really taken). In the digital world, 
individual behavior and legal processes both may not be so well formed to 
support the technology of digital signatures. I believe this is what the 
original point was.

IANAL, but enforceability of such a kluged, digitally-signed document seems 
in doubt. There is a long history of that type of contract support in our 
US legal systems, and probably others as well. There will surely be 
difficulties adapting it to the digital domain, but I think the basis for 
support is already there

Anyway, it is not, but maybe should be well known, that the purpose of 
digital signatures, is to verify who the sender is, and to verify that the 
document has not been changed in transit. That it might contain tricky 
language or information is an important thing to note, but the reader still 
needs to rely on the document's contents with the same skeptical eye as if 
it were not printed. Just as the Unicode bi-di algorithm makes no claims at 
reversibility, digital signing algorithms make no claim that the signed 
contents are correct,or even useful.







Re: Unicode and Security

2002-02-03 Thread Barry Caplan

At 02:15 PM 2/3/2002 +0900, you wrote:
On Sat, 2 Feb 2002, David Starner
wrote:
[...several lines cut to save room...]
 I think I'm missing your perspective. To me, these are minor quirks.
Why
 do you see them as huge problems?
I am thinking about electronically signed Unicode text documents
that are rendered correctly or believeed to be rendered correctly,
still they look different, seem to contain additional or do not
seem to contain some text when viewed with different viewers due
to some ambiguities inherent in the standard.
An electronically signed document allows you to trust who wrote it, and
that the *byte* sequence* hasn't been tampered with. It implies nothing
at all trust wise about what software you should use to interpret it. You
would go through the trouble to verify a signature, but trust the .doc
extension and some machine's implementation of Word with your money?
Makes no sense.
That being said, identifying security issues of existing programs and or
protocols when they intersect with Unicode-based data is an important
issue, and one I intend to cover regularly on
www.i18n.com, once it
launches this month.
For those of you that have specific issues to write about, or are
interested in providing a series of security-related articles (length and
frequency TBD, please contact me off-list. I think there are endless
examples already out there, to provide, and I know of at least one that
is serious. Let's find more!


Best Regards,
Barry Caplan
www.i18n.com
- coming soon, preview available now
News | Tools | Process for Global Software
Team I18N



Re: VIRUS!!!!! (was Re: new photos from my party!)

2002-01-28 Thread Barry Caplan

Yeah, I wrote about that before going to bed last night, and the photos
virus *made it through* on a Yahoo Group I am subscribed to, even though
apparently the list is set to *no attachements*.
Great.
Lucky for me I won't let MS Outlook anywhere near any of my
computers.

At 05:29 PM 1/28/2002 +, Michael Everson wrote:
Now, Sarasvati, what did I say
about attachments?
-- 
Michael Everson *** Everson Typography ***
http://www.evertype.com


Best Regards,
Barry Caplan
[EMAIL PROTECTED]
www.i18n.com
- coming soon, preview available now
News | Tools | Process for Global Software
Team I18N


Re: Variation Selection

2002-01-27 Thread Barry Caplan

At 10:29 PM 1/27/2002 -0500, you wrote:
In a message dated 2002-01-27
18:51:35 Pacific Standard Time, 
[EMAIL PROTECTED] writes:
 First, have we all servers?
No. Assuming we all do is no better than assuming we all have
broadband or 
T1 connections.
Yes, we do all have servers:
Yahoo is your friend - you can get an unlimited number of 6mb (I think,
maybe more) accounts for free. Store images in
briefcase.yahoo.com/yourid. Store any files in
briefcase.yahoo.com/yourid.
Also, this list is mirrored on a yahoo group. the group has storage space
too. I don't know ho the moderator of that group is, but maybe he/she can
assist.
In any of these cases, all that needs to be passed to the list is the
url.
Frankly, the issue of unexpected attachments in email is not the size for
me, but it does cause me security concerns. I would much rather decide
whether or not to download a file then to wake up one morning with a
virus or worse.


Best Regards,
Barry Caplan
[EMAIL PROTECTED]
www.i18n.com
- coming soon, preview available now
News | Tools | Process for Global Software
Team I18N



Re: FW: Please help me

2002-01-21 Thread Barry Caplan

If I recall correctly there was a presentation on Uighur an Unicode at
the September 2000 conference in San Jose. I think one of the main topics
was creating fonts to display the language. Perhaps the talk is archived
at the Unicode.org web site?
Best,
Barry Caplan
At 10:46 AM 1/21/2002 -0800, you wrote:

-Original Message-
From: King of kids
[mailto:[EMAIL PROTECTED]]

Sent: Saturday, January 19, 2002 1:55 AM
To: [EMAIL PROTECTED]
Subject: Please help me
Dear Sir/Madam,
 
 Recently, I have heard of that all the Uighur (also called Uyghur,
which is more standard in Uyghur Langauage) language letters are already
in the Unicode Standard 3.1. I have seen all the Uyghur letters in: 

 1.
http://www.unicode.org/charts/PDF/U0600.pdf
 2.
http://www.unicode.org/charts/PDF/U0600.pdf
 3.
http://www.unicode.org/charts/PDF/UFE70.pdf
 But, I could not find some of them within any font sets of
Windows98/XP/2000. Could you tell me where can I find a font set (ex:Like
Lucida Sans Unicode) in which I can find The Unicode Standard 3.1's
Uyghur letters?(A font that contains all codes points within The Unicode
Standard 3.1.)

 Regards,

An Uyghur in Xinjiang Uyghur's Autonomous Region, PRC
99' Graduage Student
Computer Department, Xinjiang University
Waris Abdukerim

* I would like to remind you that some Uighur(Uyghur) letters were not
available in The Unicode Standard 3.0, but I found all of them in The
Unicode Standard 3.1. Thanks very much.


Re: Devanagari

2002-01-20 Thread Barry Caplan

At 10:44 PM 1/20/2002 -0500, you wrote:
Taking the extra links into account the sizes are:
English: 10.4 Kb
Devanagari: 15.0 Kb
Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
of documents/manuscripts (in plain text) in Devanagari, this factor could be
as high as approx. 3 using UTF-8 and around 1 using ISCII.


Yes, but that is this page only. Are you suggesting that all pages will 
vary by that factor? Of course not.

Please consider whether the space *in practice* is a limiting factor. It 
seems that folks on the list feel it is not. Not for bandwidth limited 
applications, and not for disk space limited applications.

The amount of space devoted to plain text of any language on a typical web 
page is microscopic compared tot he markup, images, sounds, and other files 
also associated with the web page.

Are you suggesting that utf-8 ought to have been optimized for Devanagari text?

Barry Caplan
www.i18n.com -- coming soon...






Re: The benefit of a symbol for 2 pi

2002-01-18 Thread Barry Caplan

At 10:06 AM 1/18/2002 -0700, Robert Palais wrote:
Which seems to make Unicode a defender of the status quo. Inaction is
as political as action. We are holders of the standards
for the technology for encoding symbols, and we won't admit new symbols
until they are widely used... not necessarily the intent, but possibly
the impact - that evolution of symbolic communication will be hampered?

I think anyone is free to have other competing standards, and there have 
been other strong ones during the lifecycle of Unicode (ISO 10646 for 
instance).

No one doubts that there are other characters that would be useful to 
encode. But the original concept of unicode as a 2 byte encoding leaves 64K 
code points. Unicode as a group quickly found out that was not enough to 
make everyone happy. As it is, the standard is rife with kluges in the 
encoding scheme.

The limitation of characters to those that are in current use is related in 
large part to the code point limitations and partially to the desire to 
prioritize work. It takes the same amount of work to add a character or 
group of characters regardless of whether or not those characters will be 
used. there are plenty of characters which exist in the literature that are 
not ended in Unicode, and in fact are specifically excluded: those of 
written but dead languages. Newly proposed characters at least have a 
process: get them in use and addition to Unicode will be easy.

In your case, one way to go about that may be to build a (probably pretty 
straightforward) script that searches out instances of 2pi in  tex and word 
files, etc., and replaces them with newpi references. Create a font which 
has this character (maybe where the pi is now, or as a user defined char?). 
Make it easy for folks to get and use these tools. Soon there either will 
or will not be a substantial body of literature using newpi instead of pi, 
and a large discussion of why and how its adoption in math texts should 
happen. Once that is in place, I do not think you will be disappointed by 
the Unicode group.

Right now newpi seems like a meme that is likely to die to the Unicode 
folks. Show otherwise, and life will be easy, as it was for the euro 
proponents.

Best,

Barry Caplan
www.i18n.com -- coming soon, sign up for features and launch announcements





Re: The benefit of a symbol for 2 pi

2002-01-18 Thread Barry Caplan

At 01:45 PM 1/18/2002 -0500, you wrote:


The limitation of characters to those that are in current use is related 
in large part to the code point limitations


What limitations?  We have over a million codepoints to play with.
There is plenty of room.

I've always been under the impression that one of the original goals of the 
Unicode effort was to do away with he sort of multi-width encodings we are 
all too familiar with (EUC, JIS, SJIS, etc.). this was to be accomplished 
by using a fixed width encoding. In my mind, everything other than that in 
order to increase space (but not necessarily to save bandwidth) is a kluge, 
and a compromise, because it means code still has to be aware of the 
details of the the encoding scheme.

I do not dispute that with the kluges/compromises, there is plenty of room.


There are plenty of characters which exist in the literature that are not 
ended in Unicode, and in fact are specifically excluded: those of written 
but dead languages.


They are not only not excluded, they are included: Runic and Deseret
are just the beginning.  There are many pending proposals for things
like hieroglyphs and cuneiform.


Now that there are kluges that allow for extra room. But wasn't it not 
always the case historically speaking that these languages were, shall we 
say, less than welcome?






Re: The benefit of a symbol for 2 pi

2002-01-16 Thread Barry Caplan

At 11:33 AM 1/16/2002 -0700, Robert Palais wrote:
is at the same time somewhat a Catch-22. Nelson Beebe recommended it since
he figured unicode 3.2 would be the make or break for getting it in use.
I'd be curious if you disagree with the thesis that a symbol for
6.28 has scientific/mathematical merit (in comparison 3.14...), and if so
why?


My guess is that since pi is the ratio of the circumference to the 
diameter, that the diameter is a more natural conception of the size of a 
circle than the radius. Of course mathematically, it doesn't matter other 
than the factor of 2. But other geometrical shapes, particularly polygons, 
are measured by line segments that extend from one point to another on the 
same shape, or series of shapes. A radius just sort of ends in the middle, 
while a diameter or other chord begins and ends on the circle.

I can't quote the history, but if I imagine back to the Greek days, I bet 
the diameter was the primary measure. Other polygonal shapes with which 
they were familiar had their measures in terms of a line segment crossing 
the entire shape and touching the boundaries, or coincident with the boundary.

Mathematicians pondering the circle for the first time, there probably was 
no reason to think otherwise. How to proceed from there to figure the area 
of a circle or  the ratio of the diameter to the circumference were 
probably some of the greatest challenges of the day. They wanted to know 
the circumference and area, same as they had calculated for other shapes.

I would guess that since pi is the ratio of the circumference and diameter, 
that this problem was solved first. Had it been the other way around, our 
formulas might look the way Dr. Palais suggests.

Now that I think about it, I wonder if the very concept for radius grew 
out of the solution to the area of the circle: was the original formula A = 
pi * (d over 2)squared? If so, then maybe a conceptual leap was made to 
simplify it, thus inventing the radius.

Why simplify the d/2 part and not the other way (pi/4)? Probably because pi 
is just a number, while d/2 turned out to have some connection to the 
physical world - the distance from the edge of a circle to the center.

But this is just idle lunchtime speculation on my part.

Note that using the new symbol the circumferance of a circle is simply 
tri*r, but the Area changes form pi*r(squared) to tri *(1/2) times r 
squared, so you lose as much as you gain it seems to me.


Barry Caplan





Re: Question

2002-01-15 Thread Barry Caplan

Can you describe the nature of the script and how it uses Unicode (if at 
all) or what it uses for text processing. What version of Unicode are you 
using now for your data?

Best regards,

Barry Caplan


At 05:15 PM 1/15/2002 -0800, BBCOA  Webmaster wrote:

  Hello. I am looking for help with Unicode. I was recently told by my credit
card processing company that I need to Upgrade my site to unicode 3.2 in
order to get a perl script working. I was wondering how I might be able to
do this. I have no idea how to install or find the lastest version of
unicode.


Gustavo A. Higuera
BBCOA Webmaster
818-757-7123 ext 222