Re: Ternary search trees for Unicode dictionaries

2003-11-17 Thread Mark Davis
We tend to use tries, which have very good performance characteristics. See
"bits of unicode" on my site: www.macchiato.com.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Theodore H. Smith" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Mon, 2003 Nov 17 18:21
Subject: Re: Ternary search trees for Unicode dictionaries


> I've looked into the TST thing.
>
> I'm not sure that it is optimal, despite how popular they are!
>
> Look at this, if I add "1", "2","3", "4", "5", "6", "7", "8", "9" to a
> TST, they will all be in a line, in the tree. All will be reference via
> the "high" node.
>
> So, to find "9", I have to read through 9 items!
>
> Now, I'm not sure if this is a bad thing. It's just that compared to a
> binary search on an array, in this example, a TST does more work.
>
> When reading in data, that has already been sorted, I think this could
> be a problem. It's fine for randomly ordered data, but with sorted
> data... I can't tell how much of a problem it could be.
>
> This is the kind of structure that punishes neat people. An unexpected
> payback for the effort of being neat and ordered.
>
> I'm quite new to this, and so I don't know if there is a good solution
> to making the tree's balanced without imposing huge overhead, or if in
> practice, this will be a problem.
>
>
>




Linguistic Diversity and National Unity: Language Ecology in Thailand

2003-11-17 Thread Eric Muller
I just finished reading âLinguistic Diversity and National Unity: 
Language Ecology in Thailandâ  by William Smalley, University of Chicago 
Press, ISBN 0-226-76288/9, and I found it very interesting. However, I 
have no reference to judge it against.  Can anybody comment on it? Any 
significant change since it was published 10 years ago?

Thanks,
Eric.




Re: Problems encoding the spanish o

2003-11-17 Thread Doug Ewell
Philippe Verdy  wrote:

> If IE really wants to keep some compatibility, it may only accept the
> CESU-8 encoding only as a possible choice for its "automatic
> selection" of charsets, or display a visible replacement character
> (such as a narrow white box) for invalid characters (that could
> internally be handled as if these invalid sequences were representing
> U+).

1.  CESU-8 should *never* be auto-detected.  CESU-8 is intended for
internal use only.  Even the TR says this.

2.  CESU-8 has nothing to do with overlong sequences.  They're just as
invalid there as in UTF-8.  So I really don't know how CESU-8 got
dragged into this thread in the first place.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Ternary search trees for Unicode dictionaries

2003-11-17 Thread Theodore H. Smith
I've looked into the TST thing.

I'm not sure that it is optimal, despite how popular they are!

Look at this, if I add "1", "2","3", "4", "5", "6", "7", "8", "9" to a 
TST, they will all be in a line, in the tree. All will be reference via 
the "high" node.

So, to find "9", I have to read through 9 items!

Now, I'm not sure if this is a bad thing. It's just that compared to a 
binary search on an array, in this example, a TST does more work.

When reading in data, that has already been sorted, I think this could 
be a problem. It's fine for randomly ordered data, but with sorted 
data... I can't tell how much of a problem it could be.

This is the kind of structure that punishes neat people. An unexpected 
payback for the effort of being neat and ordered.

I'm quite new to this, and so I don't know if there is a good solution 
to making the tree's balanced without imposing huge overhead, or if in 
practice, this will be a problem.




Re: How can I input any Unicode character if I know its hexadecimal code?

2003-11-17 Thread Frank Yung-Fong Tang

hum a very stupid (but work) way.
1. use vi
2. type "&#x" + the Unicode text + ";" for each characters
3. save it as .html
4. open the file by using browser
5. copy the text
6. paste into your software.


--
Frank Yung-Fong Tang
ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvà 
SÃrviÃes
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
IÃtÃrnÃtiÃnÃlizÃtiÃn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: [OT] "Www" as an internet riddle

2003-11-17 Thread Philippe Verdy
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
> > ÂThat's good for symbolizing e-mailÂ, I said, Âbut that joint supports
> > no POP3/SMTP access, only webbrowsing. ÂYou should go for a "www"
> > instead... ÂWell, I want it as one character only. Any ideas, dummy?Â
> >
> > This dummy then produced U+02AC to the startled friend, and hurried in
> > search a sadly inexisting COMBINING LATIN SMALL LETTER W. Settled for
> > U+0651, just for the sake of it...
>
> Only a matter of time before someone asks for a precomposed:
>
> U+ AUDIBLE LIP SMACK WITH SHADDA ABOVE

For such a crazy ISP, the best symbol for it should just be:
U+0024 '$' DOLLAR SYMBOL, or even better:
U+FF04 '$' FULLWIDTH DOLLAR SYMBOL, if not just:
U+FF69 'L' SMALL DOLLAR SYMBOL

Or may be (who knows?):
U+00A3 'Â' POUND SYMBOL
U+00A4 'Â' CURRENCY SYMBOL
U+00A5 'Â' YEN SYMBOL

And why not:
U+20A9 'â' WON SYMBOL (broken web...)
U+20A5 'â' MILLIME SYMBOL (broken mail...)
U+20AB 'â' DONG SYMBOL (broken data...)
U+20A6 'â' NAIRA SYMBOL (you have NO mail waiting...)

:-))




Re: "Www" as an internet riddle

2003-11-17 Thread Kenneth Whistler

> «That's good for symbolizing e-mail», I said, «but that joint supports
> no POP3/SMTP access, only webbrowsing. «You should go for a "www"
> instead...» «Well, I want it as one character only. Any ideas, dummy?»
> 
> This dummy then produced U+02AC to the startled friend, and hurried in
> search a sadly inexisting COMBINING LATIN SMALL LETTER W. Settled for
> U+0651, just for the sake of it...

Only a matter of time before someone asks for a precomposed:

U+ AUDIBLE LIP SMACK WITH SHADDA ABOVE

--Ken *smacks his lips at the prospect*

Of course, you could also suggest to your friend U+02AC with
a "double-u" above: <02AC, 0367, 0367>





Re: Problems encoding the spanish o

2003-11-17 Thread Philippe Verdy
From: "Marco Cimarosti" <[EMAIL PROTECTED]>
To: "'Pim Blokland'" <[EMAIL PROTECTED]>; "Unicode mailing list"
<[EMAIL PROTECTED]>
> Pim Blokland wrote:
> > Not only that, but the process making the mistake of thinking it is
> > UTF-8 also makes the mistake of not generating an error for
> > encountering malformed byte sequences,
>
> BTW, this process has a name: "Internet Explorer".

Don't blame IE too much if it attempts to interpret the text using UTF-8,
because the page is tagged explicitly with a UTF-8 charset. Well, it's true
that IE should stop to use this erroneous charset tag as soon as it sees a
violation of the UTF-8 rule, and rather should attempt to use its "automatic
selection". But it's true also, that IE still attempts to use the legacy
UTF-8 encoding which allowed interpreting non-short sequences.

I do think this bug does not occur within recent updates of IE, notably
since it was corrected to remove the security hole in MSHTML.DLL to avoid
interpreting non-short sequences. If IE really wants to keep some
compatibility, it may only accept the CESU-8 encoding only as a possible
choice for its "automatic selection" of charsets, or display a visible
replacement character (such as a narrow white box) for invalid characters
(that could internally be handled as if these invalid sequences were
representing U+).

But if the user forces the UTF-8 decoding in the GUI, IE should still not
consider any invalid UTF-8 sequence, and interpret it as an invalid
character like U+ or, even better, disable this UTF-8 choice in the user
interface.

So this is really an effect of the collision of multiple Unicode violations,
both in the User-Agent interpreting the coded strings, and in the content of
the page, incorrectly labelled UTF-8 when it is not (here: complain to your
web page designer, or blame yourself if you created this page with invalid
meta-tags).

Beware, when editing an UTF-8 page that includes the UTF-8 charset metatag
explicitly, that your editor will not save it into ISO-8859-1, only because
it thinks it will save storage space...

There are also of some bogous "web site optimizers" that perform this kind
of encoding optimization (in addition to removing unnecessary spaces and new
lines, or to compressing/obfuscating the JavaScript code, CSS stylesheet
class names) and don't take care of changing the value of this meta-tag...

Changing the internal encoding of any text file without an explicit request
from the user should never be done automatically without confirmation and
logging of the actions taken.




Re: Problems encoding the spanish o

2003-11-17 Thread pepe pepe
Hello:

  My knowledge about encoding is very poor and you seem to know a lot abou 
this. could you explain a bit more what you have said. I have made the 
following:

This is the problematic sequence 0011-01101110-0010-01001101 
(F3-6e-20-4d) if I follow the instructions that appaear in the question(What 
is UTF-8?) in the UTf-8 fAQ i obtain the following
01110111010001101 instead 1EE80D 0111010001101(Have I made a 
mistake?) Following the utf-16 encoding from my result all works well. so to 
finalize who do you think that is the responsible for this strange situation 
the client for saying that the doc is utf-8 or the parser.

Regards,
Mario.


From: Pim Blokland <[EMAIL PROTECTED]>
To: Unicode mailing list <[EMAIL PROTECTED]>
Subject: Re: Problems  encoding the spanish o
Date: Mon, 17 Nov 2003 13:26:19 +0100
pepe pepe schreef:

>   We have the following sequence of characters "...ización Map.."
that is
> the same than "...ización Map..." that after suffering some
> transformations becomes to "...izaci�&56333;ap"
> AS you can see the two characters 56186 and 56333 seem to
represent this
> sequences "ón M". Any idea?.
Yes, your input text obviously gets flagged as being in UTF-8
format, even if it is Latin-1 (or any codepage that has a ó at index
243).
Not only that, but the process making the mistake of thinking it is
UTF-8 also makes the mistake of not generating an error for
encountering malformed byte sequences, AND of outputting the result
as two 16-bit numbers instead of one 21-bit number.
If you take the byte sequence (hex) F3 6E 20 4D and treat it as
UTF-8 and don't care it's not valid, this maps to the value
(hex)1EE80D. Again, not caring this is not a valid codepoint,
turning this into UTF-16 would yield U+DB7A U+DC0D, which is what
you got in your output.
Pim Blokland



_
Dale rienda suelta a tu tiempo libre. Encuentra mil ideas para exprimir tu 
ocio con MSN Entretenimiento. http://entretenimiento.msn.es/




RE: Problems encoding the spanish o

2003-11-17 Thread Marco Cimarosti
Pim Blokland wrote:
> Not only that, but the process making the mistake of thinking it is
> UTF-8 also makes the mistake of not generating an error for
> encountering malformed byte sequences,

BTW, this process has a name: "Internet Explorer".

> AND of outputting the result as two 16-bit numbers instead of one
> 21-bit number.

I guess that this resulted by copying & pasting the resulting text in an
editor and saving it as UTF-16.

_ Marco



Re: Problems encoding the spanish o

2003-11-17 Thread Pim Blokland
pepe pepe schreef:

>   We have the following sequence of characters "...ización Map.."
that is
> the same than "...ización Map..." that after suffering some
> transformations becomes to "...izaci�&56333;ap"
> AS you can see the two characters 56186 and 56333 seem to
represent this
> sequences "ón M". Any idea?.

Yes, your input text obviously gets flagged as being in UTF-8
format, even if it is Latin-1 (or any codepage that has a ó at index
243).
Not only that, but the process making the mistake of thinking it is
UTF-8 also makes the mistake of not generating an error for
encountering malformed byte sequences, AND of outputting the result
as two 16-bit numbers instead of one 21-bit number.

If you take the byte sequence (hex) F3 6E 20 4D and treat it as
UTF-8 and don't care it's not valid, this maps to the value
(hex)1EE80D. Again, not caring this is not a valid codepoint,
turning this into UTF-16 would yield U+DB7A U+DC0D, which is what
you got in your output.

Pim Blokland





RE: Problems encoding the spanish o

2003-11-17 Thread Marco Cimarosti
pepe pepe wrote:
>   We have the following sequence of characters "...ización 
> Map.." that is the same than "...ización Map..." that
> after suffering some transformations becomes to
> "...izaci�&56333;ap" AS you can see the two
> characters 56186 and 56333 seem to represent this 
> sequences "ón M". Any idea?.

Yes. In the  of your HTML file, you should have a line like this:



Change "utf-8" to "iso-8859-1", or simply remove the whole line.

_ Marco





Problems encoding the spanish o

2003-11-17 Thread pepe pepe
Hello:

 We have the following sequence of characters "...ización Map.." that is 
the same than "...ización Map..." that after suffering some 
transformations becomes to "...izaci�&56333;ap"
AS you can see the two characters 56186 and 56333 seem to represent this 
sequences "ón M". Any idea?.

Regards,
Mario.
_
Charla con tus amigos en línea mediante MSN Messenger. 
http://messenger.microsoft.com/es




Intercalary heads of the Tai Xuan Jing

2003-11-17 Thread Patrick Andries
Does someone know if the intercalary heads of Tai Xuan Jing are
coded in Unicode? If so, which code number were they given?
 
 (The Intercalary heads are used for Dec 21. P.M. and for Feb 29 on leap
 years)
 
 P. A,