Char Set Detector

2003-07-17 Thread Yogesh Kumar Ahuja
Hi,
 Can any body give me the path/url to get a Char set Detector. which can
detect the char set's encoding scheme ..atleast for shift_jis and big5.
I downloaded from mozilla and build it. It's working fine for UTF8 but
failing for shift_jis and big5 in some cases. I'm working on HP unix. If any
body is working on same type of work let me know please.

Thanks,
-Yogesh



RE: missing .GIF's for ideographs on unicode.org?

2003-07-17 Thread Ostermueller, Erik
Richard wrote:

   Erik, I think you are correct. The link should be like so:
   
   http://www.unicode.org/cgi-bin/refglyph?24-2
   
   I'm guessing this just hasn't been implemented yet.

I swear I've seen the glyph on this page before.  When I looked at it in the PDF I 
immediately recognized it,
and I'm not a Chinese speaking-kind-of-guy.



Re: [Private Use Area] Audio Description, Subtitle, Signing

2003-07-17 Thread William Overington
Michael Everson raises some interesting points.

William.

If CENELEC wishes to standardize a set of icons, they will do so. If
they have a need to interchange data using those icons, they will (if
they are wise) come to us an ask to encode them. If they want to use
the Private Use Area before they do that, they will.

Perhaps I may explain the situation?  The European Commission asked Cenelec
to conduct a project about establishing a process to implement interactive
television in the European Community.

A consultancy was asked to produce a report.  Cenelec organized a forum for
the report to be discussed and also arranged an open meeting which was held
in Brussels on 12 March 2003.  The report was made available in the forum
before the meeting.

I did not attend the meeting, though I did post some comments into the forum
before the meeting.  A list of the people due to attend the meeting was
published in the forum prior to the meeting.  Most, though not all, are
representing organizations.

After the meeting a later version of the report was produced.  The report
suggests various aspects of the necessary work be done by various existing
committees and standards bodies.

The forum remains in use.  One participant recently put forward the idea of
agreeing on logos for Audio Description, Subtitle, Signing and provides a
link to the page which I mentioned earlier in this thread.

I added the suggestion of adding the symbols into the built-in font of
DVB-MHP televisions.  I suggested the desirability of regular Unicode code
point allocations.  I mentioned the time scale and I mentioned the Private
Use Area code points for the symbols, that is one code point for each of
Audio Description, Subtitle, Signing, not one code point for each of the
logos being considered.  I suggested some specific code points.  I like to
think that I quite clearly stated my interest in choosing those suggested
code points so as not to clash with my suggestions for other uses of Private
Use Area code points.

For the avoidance of doubt, my suggestions for using Private Use Area code
points for eutocode graphics do not need to be accepted by a standards body
as they are only meaningful for use in text files which customize those Java
programs which recognize them as they do not need to access the built-in
font of the television set and they are only used in programs where the
eutocode graphics system is used.  There are glyphs for authoring-time, but
no glyphs are needed for run-time use as the codes activate graphics
features at run-time.

So the specific suggestion for code points for Audio Description, Subtitle,
Signing is within the forum.

Now, I am unsure at present as to how the various committees and standards
bodies are to proceed, yet I have sought to place my ideas in the forum in
the hope that they will reach the agendas of the meetings.

I also, some time later, posted the code point suggestions here.  I did this
with the intention of making the suggested code points more widely known.

Please don't tell us all about it over and over again, as you have
done.

Well, I made one post at the start of this thread.  After that I have just
responded to comments which have been made.

If you want to talk to CENELEC, do so. Please stop trying to
peddle your PUA schemes for CENELEC to us.

Well, I had not thought of it as peddling!

I maintain the ConScript Unicode Registry, which contains PUA
assignments. I do not promulgate those on this list. (Apart from that
fun testing of the Phaistos implementation some time ago.)

Well, that is a matter for you.  Actually, I rather enjoy reading your
postings and would indeed be pleased if you did post them in this forum.
However, I do understand your point, though as this mailing list has rules
which encourage discussions amongst users of the Unicode Standard, I feel
that you are being somewhat harsh in your criticism.

Anyway, you only joined in to a thread about the Phaistos Disc script, you
did not start it as I seem to remember.  So you were responding to an
enquiry and, indeed, I feel, being very helpful in adding Phaistos Disc
script into the ConScript Unicode Registry.  It will be fascinating to
observe what happens if an archaeological dig somewhere, maybe nowhere near
where the Phaistos Disc was discovered, produces a lot of items with
Phaistos Disc script upon them.  Then instead of Unicode being ready, there
will then be a long wait for a regular Unicode implementation, when it could
have been done years ago!

The taboo of discussing PUA code points which some people have causes lots
of unnecessary problems.  For example, the Unicode Consortium is so
taboo-avoiding about mentioning PUA assignments that when I had some time
ago heard that Microsoft used part of the PUA in a special way in symbol
fonts I had enormous difficulty in finding out where it was located!  Also,
I seem to remember a long time ago trying to find out where it was that I
had seen Tengwar encoded in the 

Re: missing .GIF's for ideographs on unicode.org?

2003-07-17 Thread John H. Jenkins
On Thursday, July 17, 2003, at 12:00 AM, Richard Cook wrote:

I'm guessing this just hasn't been implemented yet.

You are guessing correctly.  Once some of the dust settles from my day 
job, I expect I can get to this.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



Re: [Private Use Area] Audio Description, Subtitle, Signing

2003-07-17 Thread Kenneth Whistler
William spilled another ocean of digital ink. Found bobbing
in that ocean was the comment:

 Roozbeh and I assigned two unencoded characters for Afghanistan to
 the PUA, and we encourage implementors to use them until such time as
 the characters are encoded.
 
 Yes.  ...  Now that at least one of them has been approved for
 encoding by the Unicode Technical Committee there is now a long period of
 waiting during which Private Use Area encoded data can be produced.  This
 does seem unfortunate and for individual symbols such as these I would hope
 that the people who are in charge of Standards might like to consider asking
 if the United Nations and the World Trade Organization could perhaps arrange
 for some faster way of achieving agreement.

*rolls eyes*

It seems rather unlikely that getting the United Nations and the
World Trade Organization involved in trying to amend JTC1 standards
directives would be a recipe for speeding anything up. :-)

 It does seem so very slow for
 the twenty-first century with so many electronics communications facilities!
 Why does legacy data have to build up and resolving the problem take so long
 for just a few symbols? 

Because amending and updating a standard is effectively the same
task whether it involves 1 additional character or 181 additional
characters. There are a large number of stages, approvals, reviews,
and other tasks involved -- which are there for a reason, to ensure
the stability and orderly maintenance of the standard.

 I would have thought that with a reasonable
 infrastructure that those two code points could have been formally added
 into regular Unicode and ISO within a couple of weeks. 

The whole idea of adding a couple code points this week and then a
couple more next week, and then another next month, and so on is,
well, just nuts. It would destroy effective version control and
would create a situation where implementers were unsure just what
was in the standard and when it would change further. It would
*damage* the standard rather than improve anything.

A character encoding standard is not just a laundry-list registration
of characters that people happen to notice this week. As such, it
is not advisable to create a mechanism whereby new characters
are noticed, approved, and registered on a weekly basis. 

 An ocean of digital ink!  I like that phrase.  

As well as producing the oceans, clearly.

 That person added that people
 have been telling me for a long time that PUA codes are not suitable for
 interchange.

Not suitable for *public* interchange, because, by definition, in
public interchange the receiver will not be a party to whatever
*private* agreement defines their usage, and so will not be able
to interpret them.

 
 That puzzles me, because I thought that it was alright to interchange
 Private Use Area codes if there is an agreement as to their meaning in a
 particular situation.

Yes, a *private* agreement for *private* interchange. That, as Michael
tried to tell you, is why we call them *private* use characters.  

 Also, Unicode 3.0 mentions the possibility of
 publication of Private Use Area assignments 

Anyone is free to publish anything they wish, including lists of PUA
assignments.

 in the section on the Private
 Use Area.

But the Unicode Consortium will not publish such lists in the
Unicode Standard or on its website in any official way.

 
 So what is the official position please? 

I just stated it. If you want chapter and verse:

All code points in the blocks of private-use characters in the
Unicode Standard are permanently designated for private use--no
assignment to a particular, standard set of characters will ever
be endorsed or documented by the Unicode Consortium for any of
these code points.
 -- The Unicode Standard, Version 4.0,
Section 15.7, Private-Use Characters,
p. 398, 2003 [forthcoming]

 This is important to me because I
 have been proceeding in the belief that suggesting three Private Use Area
 code points for use in interactive television systems is entirely proper and
 compliant with Unicode and the ISO standard.

It is. But other participants on this email list have been telling
you that they are not interested in your *particular* use of
private use characters.

--Ken




Re: [Private Use Area] Audio Description, Subtitle, Signing

2003-07-17 Thread Michael Everson
At 17:01 +0100 2003-07-17, William Overington wrote:
Michael Everson raises some interesting points.

William.

If CENELEC wishes to standardize a set of icons, they will do so. If
they have a need to interchange data using those icons, they will (if
they are wise) come to us an ask to encode them. If they want to use
the Private Use Area before they do that, they will.
Perhaps I may explain the situation?
No, thank you. If CENELEC wants to propose characters to the Unicode 
Standard, they can contact us. I'd be interested in helping, if they 
had a good case. But I'm not looking for extra work right now.

Now, I have never heard of the MES-2 whatever that is.  However, I do not
have deep knowledge of the various standards which exist.  Could you
possibly say some more about MES-2 please.
A.4.2 282 MES-2

282 MES-2 is specified by the following ranges of code positions as 
indicated for each row.

Rows Positions (cells)

00 20-7E A0-FF

01 00-7F 8F 92 B7 DE-EF FA-FF

02 18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D8-DD EE

03 74-75 7A 7E 84-8A 8C 8E-A1 A3-CE D7 DA-E1

04 00-5F 90-C4 C7-C8 CB-CC D0-EB EE-F5 F8-F9

1E 02-03 0A-0B 1E-1F 40-41 56-57 60-61 6A-6B 80-85 9B F2-F3

1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-7D 80-B4 B6-C4 C6-D3 
D6-DB DD-EF F2-F4 F6-FE

20 13-15 17-1E 20-22 26 30 32-33 39-3A 3C 3E 44 4A 7F 82 A3-A4 A7 AC AF

21 05 16 22 26 5B-5E 90-95 A8

22 00 02-03 06 08-09 0F 11-12 19-1A 1E-1F 27-2B 48 59 60-61 64-65 82-83 95 97

23 02 10 20-21 29-2A

25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 AC B2 
BA BC C4 CA-CB D8-D9

26 3A-3C 40 42 60 63 65-66 6A-6B

FB 01-02

FF FD
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Article on Unicode in Globalization Insider

2003-07-17 Thread Thomas Dickey
On Wed, Jul 16, 2003 at 01:01:30PM -, [EMAIL PROTECTED] wrote:
  http://www.lisa.org/archive_domain/newsletters/2003/
  3.2/lommel_unicode.html
 
  This link seems to be broken. I get a message *Our apologies*
  *The page you requested is not available.*
 
 I guess you just have to combine the whole URL properly into one line.

it works for me (lynx).

-- 
Thomas E. Dickey [EMAIL PROTECTED]
http://invisible-island.net
ftp://invisible-island.net



Re: missing .GIF's for ideographs on unicode.org?

2003-07-17 Thread Rick McGowan
Ostermueller, Erik wrote:

 At unicode.org, when I click this link,
 http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=2
 I'm expecting to see a little square GIF that displays U+2.
 Instead, I see N/A.

This has now been fixed. Thank you for pointing out the error. The code  
was only showing glyphs = 0x, even though the glyph database has all  
of the plane 2 Han glyphs in it.

Rick



Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)

2003-07-17 Thread Philippe Verdy
On Thursday, July 17, 2003 9:23 PM, Michael Everson [EMAIL PROTECTED] wrote:

 At 17:01 +0100 2003-07-17, William Overington wrote:
  Now, I have never heard of the MES-2 whatever that is.  However, I
  do not have deep knowledge of the various standards which exist. 
  Could you possibly say some more about MES-2 please.
 
 282 MES-2 is specified by the following ranges of code positions as
 indicated for each row.
 Rows: Positions (cells)
 00: 20-7E A0-FF
 01: 00-7F 8F 92 B7 DE-EF FA-FF
 02: 18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D8-DD EE

 03: 74-75 7A 7E 84-8A 8C 8E-A1 A3-CE D7 DA-E1
 04: 00-5F 90-C4 C7-C8 CB-CC D0-EB EE-F5 F8-F9

 1E: 02-03 0A-0B 1E-1F 40-41 56-57 60-61 6A-6B 80-85 9B
   F2-F3
 1F: 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-7D
   80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 F6-FE
 20: 13-15 17-1E 20-22 26 30 32-33 39-3A 3C 3E 44 4A
   7F 82 A3-A4 A7 AC AF 
 21: 05 16 22 26 5B-5E 90-95 A8
 22: 00 02-03 06 08-09 0F 11-12 19-1A 1E-1F 27-2B
   48 59 60-61 64-65 82-83 95 97 
 23: 02 10 20-21 29-2A

 25: 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C
   90-93 A0 AC B2 BA BC C4 CA-CB D8-D9
 26: 3A-3C 40 42 60 63 65-66 6A-6B

 FB: 01-02
 FF: FD

As most of these characters are canonically decomposable, shouldn't this
list include also the decomposed characters?

Why is row 03 so resticted? Shouldn't it include those accents and
diacritics that are used by other characters once canonically
decomposed? Or does it imply that MES-2 is only supposed to use
strings if NFC form?

Also, is this list under full closure with existing character properties, like
NFKD decompositions, and case mappings?

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)

2003-07-17 Thread Kenneth Whistler

  282 MES-2 is specified by the following ranges of code positions as
  indicated for each row...

Philippe Verdy asked:

 As most of these characters are canonically decomposable, shouldn't this
 list include also the decomposed characters?
 
 Why is row 03 so resticted? Shouldn't it include those accents and
 diacritics that are used by other characters once canonically
 decomposed? Or does it imply that MES-2 is only supposed to use
 strings if NFC form?

MES-2 (and all the rest of the Multilingual European Subsets) are
a CEN construct. See the CEN Workshop Agreement, CWA 13873:2000
posted at Michael Everson's site:

http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf

Among other things, that CWA states:

This CWA does *not* specify any encoding of the European Subsets.

so conceptually it is more like a repertoire listing.

MES-2 is formally listed in 10646 as one of the normative subsets
there, but since 10646 has no concepts of decomposition, normalization,
or equivalence, the fact that MES-2 contains precomposed characters
but not their decompositions or the relevant combining accents
is formally irrelevant.

The Unicode Standard does not make subsets a normative construct
for that standard and doesn't even mention MES-2. Conformance to
10646 doesn't require you to make use of its subsets, but if anyone
is worried about the articulation of the standards, the Unicode
Standard itself formally consists of Subset 305 of 10646:2003,
namely the UNICODE 4.0 subset -- the subset which contains *all*
of the encoded characters of 10646:2003.

Think of the Multilingual European Subsets as a kind of
way for people in Europe associated with standards organizations
and governments to try to communicate with software vendors
regarding which user characters they want to ensure are
supported by their software. The CWA 13873 contains some
questionable presuppositions about how software vendors are
actually proceeding to roll out their Unicode support, but
the intent of the CWA is clear:

It is estimated that implementing the full character set of the
UCS may be costly in the first stages of UCS use, and that many
manufacturers will implement in subset-stages. To ensure that a
common subset usable to the vast majority of European users be
available for a reasonable price, and as a guide to manufacturers,
it will be helpful to specify, to users and procurers of systems,
European subsets of the UCS encompassing the characters for use
in European languages as well as other frequently used and
specialist characters.

 Also, is this list under full closure with existing character properties, like
 NFKD decompositions, and case mappings?

MES-2 is clearly *not* closed under NFD, NFKD, or NFKC normalizations.

Although less obvious, it is also not closed under NFC
normalization. For example, it includes the angle brackets
U+2329, U+232A, but not their canonical equivalents,
U+3008, U+3009. There are also some characters outside the MES-2 
repertoire where NFC(x) *is* in the MES-2 repertoire. Singleton canonical
equivalences like U+212B ANGSTROM SIGN come to mind, for example.

I haven't checked on case mappings and case foldings, but would
not be too surprised to find an anomaly or two there, as well.

MES-2 was not designed by the UTC, nor did it take any of
these considerations into account. It is not really an
appropriate construct for the Unicode Standard. A more
meaningful way to think of it is: if you want to sell software
in Europe, you better be able to input and display all the
characters we Europeans have in this list.

--Ken




Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)

2003-07-17 Thread Philippe Verdy
On Friday, July 18, 2003 2:18 AM, Kenneth Whistler [EMAIL PROTECTED] wrote:

 MES-2 was not designed by the UTC, nor did it take any of
 these considerations into account. It is not really an
 appropriate construct for the Unicode Standard. A more
 meaningful way to think of it is: if you want to sell software
 in Europe, you better be able to input and display all the
 characters we Europeans have in this list.

I interpret it like this way:

MES-2 is a collection of characters independant of their actual encoding.
To support MES-2 in a Unicode-compliant application, extra characters
need to be added, notably if the minimum requirement for information
interchange is the NFC form used by XML and HTML related standards.

It would be interesting to inform CEN about how MES-2 can be
documented to comply with all normative Unicode algorithms, and
the minimum is to ensure the NFC closure of this subset, which
should have better not included compatibility characters canonically
decomposed to singleton decompositions, and should now reintegrate
the missing NFC form.

For obvious reasons, the case mappings should also be closed, but
not necassarily compatibility decompositions, or characters needed
for the NFD form (notably combining diacritics, which may be added
only on applications that can process and recompose them on the
when querying supported precomposed characters in fonts).

Does the default TrueType fonts for Windows support the whole
MES-2 repertoire (Times New Roman, Arial and Courrier New),
including on Windows 95 without Uniscribe installed and used?

In practice, MES-2 support will always need additional characters
to ensure the minimum closures, and ISO10646 should work with
CEN to fix their set in a revision.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




Putting Unicode to Work

2003-07-17 Thread djinn
Although some list members may already be aware of these pages, because 
there are still very few web sites today that present text using a 
variety of Unicode ranges for purposes other than 'display testing', I 
thought the entire list should know about the Unicode pages on the Hot 
Peach Pages/EarthWords site:

1. Languages A to I (http://www.hotpeachpages.net/lang/indexu.html)
2. Languages J to Z (http://www.hotpeachpages.net/lang/index2u.html)
3. Quick definition of DV (http://www.hotpeachpages.net/lang/defnu.html)
(There is also an information page about how to 'access' Unicode at 
http://www.hotpeachpages.net/a/characters.html.)

This non-profit web site was intended as an international resource from 
its inception, and has recently incorporated Unicode in order to 
further that goal. Of the millions of web sites on the internet, it may 
in fact be 'the' one that currently makes the most use of Unicode ranges 
in furtherance of 'non-whimsical', 'non-technical' purposes.

Any suggestions for improvement would be welcomed.