Re: unidata is big

2002-04-21 Thread Geoffrey Waigh

> I would just like to know if someone could give me a tip on how to
> structure all the unicode-information in memory?
>
> All the UNIDATA does contain quite a bit of information and I can't see
> any obvious method of which is memory-efficient and gives fast access.

a) you see if there is a Unicode friendly library you can use that already
does this for you.

b) you write a program to parse the file and extract what your application
needs. With clever data encoding you can pack most of the fields of
UNIDATA into a very tight space.  Long ago in the Unicode conference
proceedings somebody illustrated how they used trie structures to
efficiently
build the lookup tables - the boring parts of the encoding space have
shorter branches than the areas where every codepoint is different from
it's neighbour.

Geoffrey





Re: Problems with viewing Hindi Unicode Page

2002-01-26 Thread Geoffrey Waigh

> I don't know if its what Peter was referring to, but there are some
> interesting crash bugs that can happen using Mangal or Latha on Win9x
> -- since neither font is supported on Win9x this is obviously more
> pilot error than product bug (unless it ever repro-ed with a font that
> is okay to put on that platform).

If the program crashes it is a bug.  If the OS cannot cope with a font
it should say so.  Unless someone is going to argue it is a design feature
where the OS detects a licensing violation and commits suicide.

Geoffrey





Re: Devanagari

2002-01-20 Thread Geoffrey Waigh

On Sun, 20 Jan 2002, Aman Chawla wrote:

> Taking the extra links into account the sizes are:
> English: 10.4 Kb
> Devanagari: 15.0 Kb
> Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
> of documents/manuscripts (in plain text) in Devanagari, this factor could be
> as high as approx. 3 using UTF-8 and around 1 using ISCII.

Well a trivial adjustment is to use UTF-16 to store your documents if you
know they are going to be predominantly Devangari.  Or if you have so much
text that the number of extra disks is going to be painful, use SCSU to
bring it very close to the ISCII ratio.  Of course I would note that you
can store millions of pages of plain-text on a single harddisk these
days.  If you going to be storing so many hundreds of millions of pages of
plain text that the number of extra disks is a bother, I am amazed that
none of it might be outside the ISCII repetoire.  And this huge document
archive has no graphics component to go with it...

But the real reason for publishing the data in Unicode on the web is so
people not using a machine specially configured for ISCII will still be
able to read and process the data.

[then later wrote:]

> With regards to South Asia, where the most widely used modems are
> approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where
> broadband/DSL is mostly unheard of, efficiency in data transmission is
> of paramount importance... how can we convince the south asian user to
> create websites in an encoding that would make his client's 14 kbps
> modem as effective (rather, ineffective) as a 4.6 kbps modem?

Can you read 500 characters per second?  So long as they are receiving
only plain text, even this dwaddling speed is not going to impact them.
People wanting to efficiently transfer data will use a compression
program.

Geoffrey








Re: What constitutes "character"?

2001-11-12 Thread Geoffrey Waigh

On Wed, 7 Nov 2001, Philipp Reichmuth wrote:

> What if it is a character where nobody knows for
> sure whether it is a character in its own right or a variant of some
> sort, in orthography, style or whatever?
> 
> What is necessary for two signs to constitute different characters in
> cases such as these?

It depends how loudly the national body sings

   Every glyph is sacred,
   every glyph is great.
   If every glyph is not encoded...

As for people wondering about doing Han characters with composition,
aside from the technical issues there are the political -
"Remember the Jamos!"

I'll be glad when the day comes that we can do a Unicode successor
with 16 bits, but we will probably have World Peace first.

Geoffrey




Re: Letters d L l and t with caron

2001-10-23 Thread Geoffrey Waigh

On Tue, 23 Oct 2001, Darren Morby wrote:

> In The Unicode Standard Version 3.0, the Latin small letters d l and t with
> caron (U+010F, U+013E, U+0165) are actually shown with a trailing apostrophe
> (d', l', t').  On each character there is the following note:
> 
> the form using apostrophe is preferred in typesetting
> 
> However, the Latin capital letter L with caron (U+013D) is shown with an
> apostrophe (L') but no note.  The Latin capital letters D and T with caron
> (U+010E, U+0164) show proper carons and notes that the preferred form is
> with a caron (hacek).
> 
> Which is the preferred form, L with an actual caron or L with an apostrophe?
> And should there not be a note on capital L like there is on small l?  (The
> note on small l does not say that it applies to capital L also.)
> 

I don't remember the answer but I vaguely recall Agfa consulting a
reference when they made the original Unicode bitmap fonts for Teklogix.

How have things been going at Teklogix/Psion?  I haven't been in touch
with anyone there since Lee left.

Geoffrey





Re: GB18030

2001-09-26 Thread Geoffrey Waigh

On Wed, 26 Sep 2001, Yung-Fong Tang wrote:

> how can you implement tolower(U+4ff3a) without knowing what U+4ff3a is ?

With a data table.  One set of debugged code that handles surrogates,
composing characters, bidirectionality etc. coupled with a datafile that
gets upgraded with each release of Unicode.  How many years does it take
to implement some of these concepts?  It shouldn't require
honest-to-goodness we-were't-kidding see-here's-one-defined-now characters
for developers to slap themselves on the head and start developing support
for these things.

Geoffrey 





Re: Possibilities of future expansion (from Perception etc thread

2001-02-25 Thread Geoffrey Waigh

On Sun, 25 Feb 2001, William Overington wrote:

[reams on the notion that the forces of glyph encoding may overwhelm
the defenders of Unicode.]

Well yes, people are free to tunnel anything they like in the PUA and
assuming their Unicode applications are willing to allow much larger
datafields than otherwise expected, they will survive intact.  But
given the aversion most parties have to implementing advanced font
rendering engines, I doubt that a uniengine like scheme is going to
be installed at the lowest levels where it belongs.

But even your alternate model where a company registers arbitrary stuff
in the PUA for fellow club members, is not going to be welcomed by people
who want actual data interchange.  Unicode contains much more than a
collection of glyphs - which is why the encoding process is fairly slow.
I just don't see people registering their glyphs for free doing the
necessary work to unify and correctly assign properties and semantics
so programs that do something other than draw pictures can actually
work with the text.

People who *really* want their quirky, "Only 6 people on the planet know
what this means," glyphs included in a document should use a higher level
protocol like HTML which lets them include inline pictures.

Geoffrey




Re: Acronyms (off-topic)

2000-08-02 Thread Geoffrey Waigh

On Wed, 2 Aug 2000, Alain LaBonté  wrote:

> À 07:12 2000-07-11 -0800, Doug Ewell a écrit:
> >Many English speakers also think ISO is an abbreviation or initialism
> >(not "acronym"; that term is correct only when the resulting "word"
> >is actually pronounced, like "AIDS" or "SIDA") of the English name
> >"International Standards (or Standardization) Organization."  Of course,
> >this is wrong.
> 
> [Alain]  ISO is not pronounced as a word in English but it is in French 

Um, I and quite a few people I know pronounce ISO as if it were a word in
English. I have no idea on the origins of the name, but given that an
uncynical view of the function of that body is to produce international
standards, I doubt the correct etymology will win through.

Geoffrey




Re: Euro character in ISO

2000-07-11 Thread Geoffrey Waigh

On Tue, 11 Jul 2000, Robert A. Rosenberg wrote:

> At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro 
> character in ISO:
> 
> >There has been an attempt to create a series of 'touched up' 8859
> >standards. The problem with these is that you get all the issues of
> >character set confusion that abound today with e.g. Windows CP 1252
> >mistaken for 8895-1 with a vengeance:
> 
> The problem would go away if the ISO would get their heads out of 
> their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and 
> put the CP125x codes there.

Except that would break all the systems that understand that C1 "junk,"
and a number of systems do so because they are adhering to other
ISO standards.  If you are going to force someone to change their
datastreams to something new, they might as well go to some flavour
of Unicode anyways.

Geoffrey
"tilting at terminal emulators, err windmills."